我有 2 个数据框,其中包含预期订单和实际订单详细信息。
输入数据:
我想在两个数据框中创建一个标签字段,并根据以下条件拆分行:
- 按国家/地区、产品和日期排序
- 按国家/地区和产品对数据框进行分组
- 在两个数据框中,对于每个组,如果行的日期和数量匹配,则分配标签相同的实际日期/相同的预期日期
- 如果数量匹配但日期不同,则分配标签(较早的预期日期/较晚的预期日期)和(较早的实际日期/较晚的实际日期)
- 如果数量不完全匹配,但该组的其他数据框中剩余数量值,则将数量值 df 较大的行拆分为 2 行:匹配(较少)数量值和剩余值
- 重复步骤,除非所有行都有标签
- 如果其他组中没有剩余数量,则为标签分配无实际日期或无预期日期
预期输出:
我正在尝试使用嵌套循环来执行此操作,但是对于数百万行,这非常慢。
for key, exp in expected_grouped:
act = actual_grouped.get_group(key)
...
for i, outerrow in enumerate(exp.itertuples()):
for j, innerrow in enumerate(act.itertuples()):
if: ...
elif: ...
有没有更好更快的方法来做到这一点?任何改进建议将不胜感激。
最佳答案
您可以使用下面的label_orders
函数:
import pandas as pd
import numpy as np
from typing import Tuple
def label_orders(
expected_orders: pd.DataFrame,
actual_orders: pd.DataFrame,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
if not (actual_orders["qty"] > 0).all():
raise AssertionError("qty must be positive")
if not (expected_orders["qty"] > 0).all():
raise AssertionError("qty must be positive")
orders_matched = match_orders(expected_orders, actual_orders)
del expected_orders, actual_orders
orders_matched["label"] = label_orders_matched(orders_matched)
expected_labeled = filter_and_relabel_expected(orders_matched)
actual_labeled = filter_and_relabel_actual(orders_matched)
return expected_labeled, actual_labeled
def match_orders(expected: pd.DataFrame, actual: pd.DataFrame) -> pd.DataFrame:
expected.sort_values(by=["country", "product", "expecteddate"], inplace=True)
actual.sort_values(by=["country", "product", "actualdate"], inplace=True)
expected["cumulative_qty"] = expected.groupby(["country", "product"]).agg(
{"qty": np.cumsum}
)
expected.drop(columns="qty", inplace=True)
actual["cumulative_qty"] = actual.groupby(["country", "product"]).agg(
{"qty": np.cumsum}
)
actual.drop(columns="qty", inplace=True)
orders_matched = pd.merge_ordered(
actual,
expected,
on=["country", "product", "cumulative_qty"],
how="outer",
)
del expected, actual
orders_matched.sort_values(
by=["country", "product", "cumulative_qty"], inplace=True
)
orders_matched["qty"] = orders_matched["cumulative_qty"] - orders_matched.groupby(
["country", "product"], sort=True
)["cumulative_qty"].shift(1)
is_first_in_group = orders_matched["qty"].isna()
orders_matched["qty"][is_first_in_group] = orders_matched["cumulative_qty"][
is_first_in_group
]
orders_matched.drop(columns="cumulative_qty", inplace=True)
orders_matched["qty"] = orders_matched["qty"].astype(int)
orders_matched["actualdate"] = orders_matched.groupby(
["country", "product"], sort=True
)["actualdate"].fillna(method="backfill")
orders_matched["expecteddate"] = orders_matched.groupby(
["country", "product"], sort=True
)["expecteddate"].fillna(method="backfill")
return orders_matched
def label_matched_orders(orders_matched: pd.DataFrame) -> pd.Series:
labels = pd.Series(index=orders_matched.index, name="label", data="")
labels.loc[orders_matched["actualdate"].isna()] = "no actual date"
labels.loc[orders_matched["expecteddate"].isna()] = "no expected date"
both_dates_present_mask = (~orders_matched["actualdate"].isna()) & (
~orders_matched["actualdate"].isna()
)
labels.loc[
both_dates_present_mask
& (orders_matched["actualdate"] < orders_matched["expecteddate"])
] = "actual before expected"
labels.loc[
both_dates_present_mask
& (orders_matched["actualdate"] == orders_matched["expecteddate"])
] = "same date"
labels.loc[
both_dates_present_mask
& (orders_matched["actualdate"] > orders_matched["expecteddate"])
] = "expected before actual"
return labels
def filter_and_relabel_actual(orders_matched: pd.DataFrame) -> pd.DataFrame:
actual_labeled = orders_matched.loc[
~orders_matched["actualdate"].isna(),
["country", "product", "actualdate", "qty", "label"],
].copy()
actual_labeled["label"] = actual_labeled["label"].map(
{
"actual before expected": "later expected date",
"same date": "same expected date",
"expected before actual": "earlier expected date",
"no expected date": "no expected date",
"no actual date": "no actual date",
}
)
return actual_labeled
def filter_and_relabel_expected(orders_matched: pd.DataFrame) -> pd.DataFrame:
expected_labeled = orders_matched.loc[
~orders_matched["expecteddate"].isna(),
["country", "product", "expecteddate", "qty", "label"],
].copy()
expected_labeled["label"] = expected_labeled["label"].map(
{
"actual before expected": "earlier actual date",
"same date": "same actual date",
"expected before actual": "later actual date",
"no actual date": "no actual date",
"no expected date": "no expected date",
}
)
return expected_labeled
说明
除了订单匹配和拆分部分外,代码很简单。但匹配和分割部分有点棘手。
让我们使用这个例子:在按日期订购后的单个(国家/地区,产品)
组中,有数量为[100, 300]
的expected_orders和数量为的actual_orders >[300, 100, 200]
。我们可以这样画:
Expected qty: |-100-|-------300-------|
Actual qty: |-------300-------|-100-|----200----|
(每个订单数量绘制为一个段。段长度等于订单数量。段并排放在一条线上,保留顺序)。
让我们通过每个线段的端点绘制垂直线,将线段分成相同长度的部分:
Vertical lines: |-----|-----------|-----|-----------|
Expected split: |-100-|----200----|-100-| .
Actual split: |-100-|----200----|-100-|----200----|
使用这个数字,我们可以将订单分成相同数量的部分:
- 预期
[100, 300] -> [100, 200, 100]
- 实际
[300, 100, 200] -> [100, 200, 100, 200]
此逻辑在 match_orders()
函数中实现:
- 通过评估组间数量的累积总和,分别列出实际订单和预期订单的垂直线位置。
- 通过合并组键和累积数量上的实际和预期数据帧来合并实际和预期的行位置
- 通过连续累积和之间的差异来评估段的长度
- 填写订单部件的
实际日期
和预期日期
def match_orders(expected: pd.DataFrame, actual: pd.DataFrame) -> pd.DataFrame:
expected.sort_values(by=["country", "product", "expecteddate"], inplace=True)
actual.sort_values(by=["country", "product", "actualdate"], inplace=True)
expected["cumulative_qty"] = expected.groupby(["country", "product"]).agg(
{"qty": np.cumsum}
)
expected.drop(columns="qty", inplace=True)
actual["cumulative_qty"] = actual.groupby(["country", "product"]).agg(
{"qty": np.cumsum}
)
actual.drop(columns="qty", inplace=True)
orders_matched = pd.merge_ordered(
actual,
expected,
on=["country", "product", "cumulative_qty"],
how="outer",
)
del expected, actual
orders_matched.sort_values(
by=["country", "product", "cumulative_qty"], inplace=True
)
orders_matched["qty"] = orders_matched["cumulative_qty"] - orders_matched.groupby(
["country", "product"], sort=True
)["cumulative_qty"].shift(1)
is_first_in_group = orders_matched["qty"].isna()
orders_matched["qty"][is_first_in_group] = orders_matched["cumulative_qty"][
is_first_in_group
]
orders_matched.drop(columns="cumulative_qty", inplace=True)
orders_matched["qty"] = orders_matched["qty"].astype(int)
orders_matched["actualdate"] = orders_matched.groupby(
["country", "product"], sort=True
)["actualdate"].fillna(method="backfill")
orders_matched["expecteddate"] = orders_matched.groupby(
["country", "product"], sort=True
)["expecteddate"].fillna(method="backfill")
return orders_matched
单组示例
输入:
expected = pd.DataFrame(
{
"country": "US",
"product": "Pen",
"expecteddate": ["2022-01-05", "2022-01-07"],
"qty": [300, 500],
}
)
print(expected)
actual = pd.DataFrame(
{
"country": "US",
"product": "Pen",
"actualdate": ["2022-01-05", "2022-01-08", "2022-01-09"],
"qty": [100, 800, 200],
}
)
print(actual)
expected_labeled, actual_labeled = label_orders(
expected_orders=expected, actual_orders=actual
)
print(expected_labeled)
print(actual_labeled)
输出
country product expecteddate qty
0 US Pen 2022-01-05 300
1 US Pen 2022-01-07 500
country product actualdate qty
0 US Pen 2022-01-05 100
1 US Pen 2022-01-08 800
2 US Pen 2022-01-09 200
country product expecteddate qty label
0 US Pen 2022-01-05 100 same actual date
1 US Pen 2022-01-05 200 later actual date
2 US Pen 2022-01-07 500 later actual date
country product actualdate qty label
0 US Pen 2022-01-05 100 same expected date
1 US Pen 2022-01-08 200 earlier expected date
2 US Pen 2022-01-08 500 earlier expected date
3 US Pen 2022-01-08 100 no expected date
4 US Pen 2022-01-09 200 no expected date
多组示例:
输入:
expected = pd.DataFrame(
{
"country": ["US"] * 2 + ["Germany"] + ["Japan"] * 5,
"product": ["Pen"] * 2 + ["Paper"] + ["Crayon"] * 5,
"expecteddate": ["2022-01-05", "2022-01-07"]
+ ["2021-12-31"]
+ ["2022-03-15", "2022-03-16", "2022-03-16", "2022-03-17", "2022-03-17"],
"qty": [100, 100, 2000, 100, 50, 150, 250, 50],
}
)
print(expected)
actual = pd.DataFrame(
{
"country": ["US"] * 3 + ["Japan"] * 4,
"product": ["Pen"] * 3 + ["Crayon"] * 4,
"actualdate": ["2022-01-05", "2022-01-08", "2022-01-08"]
+ ["2022-03-15", "2022-03-15", "2022-03-19", "2022-03-19"],
"qty": [100, 100, 100, 100, 50, 150, 250],
}
)
print(actual)
expected_labeled, actual_labeled = label_orders(
expected_orders=expected, actual_orders=actual
)
print(expected_labeled)
print(actual_labeled)
输出:
country product expecteddate qty
0 US Pen 2022-01-05 100
1 US Pen 2022-01-07 100
2 Germany Paper 2021-12-31 2000
3 Japan Crayon 2022-03-15 100
4 Japan Crayon 2022-03-16 50
5 Japan Crayon 2022-03-16 150
6 Japan Crayon 2022-03-17 250
7 Japan Crayon 2022-03-17 50
country product actualdate qty
0 US Pen 2022-01-05 100
1 US Pen 2022-01-08 100
2 US Pen 2022-01-08 100
3 Japan Crayon 2022-03-15 100
4 Japan Crayon 2022-03-15 50
5 Japan Crayon 2022-03-19 150
6 Japan Crayon 2022-03-19 250
country product expecteddate qty label
0 Germany Paper 2021-12-31 2000 no actual date
1 Japan Crayon 2022-03-15 100 same actual date
2 Japan Crayon 2022-03-16 50 earlier actual date
3 Japan Crayon 2022-03-16 150 later actual date
4 Japan Crayon 2022-03-17 250 later actual date
5 Japan Crayon 2022-03-17 50 no actual date
6 US Pen 2022-01-05 100 same actual date
7 US Pen 2022-01-07 100 later actual date
country product actualdate qty label
1 Japan Crayon 2022-03-15 100 same expected date
2 Japan Crayon 2022-03-15 50 later expected date
3 Japan Crayon 2022-03-19 150 earlier expected date
4 Japan Crayon 2022-03-19 250 earlier expected date
6 US Pen 2022-01-05 100 same expected date
7 US Pen 2022-01-08 100 earlier expected date
8 US Pen 2022-01-08 100 no expected date
关于python - 在 Pandas/Pyspark 中比较 2 个数据帧、分配标签并拆分行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74619139/