python - 在 Pandas/Pyspark 中比较 2 个数据帧、分配标签并拆分行

我有 2 个数据框，其中包含预期订单和实际订单详细信息。

输入数据:

我想在两个数据框中创建一个标签字段，并根据以下条件拆分行:

按国家/地区、产品和日期排序
按国家/地区和产品对数据框进行分组
在两个数据框中，对于每个组，如果行的日期和数量匹配，则分配标签相同的实际日期/相同的预期日期
如果数量匹配但日期不同，则分配标签(较早的预期日期/较晚的预期日期)和(较早的实际日期/较晚的实际日期)
如果数量不完全匹配，但该组的其他数据框中剩余数量值，则将数量值 df 较大的行拆分为 2 行:匹配(较少)数量值和剩余值
重复步骤，除非所有行都有标签
如果其他组中没有剩余数量，则为标签分配无实际日期或无预期日期

预期输出:

我正在尝试使用嵌套循环来执行此操作，但是对于数百万行，这非常慢。

for key, exp in expected_grouped:
  act = actual_grouped.get_group(key)
  ...
  for i, outerrow in enumerate(exp.itertuples()):
    for j, innerrow in enumerate(act.itertuples()):
      if: ...
      elif: ...

有没有更好更快的方法来做到这一点？任何改进建议将不胜感激。

最佳答案

您可以使用下面的label_orders函数:

import pandas as pd
import numpy as np
from typing import Tuple


def label_orders(
    expected_orders: pd.DataFrame,
    actual_orders: pd.DataFrame,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    if not (actual_orders["qty"] > 0).all():
        raise AssertionError("qty must be positive")
    if not (expected_orders["qty"] > 0).all():
        raise AssertionError("qty must be positive")

    orders_matched = match_orders(expected_orders, actual_orders)
    del expected_orders, actual_orders

    orders_matched["label"] = label_orders_matched(orders_matched)
    expected_labeled = filter_and_relabel_expected(orders_matched)
    actual_labeled = filter_and_relabel_actual(orders_matched)
    return expected_labeled, actual_labeled


def match_orders(expected: pd.DataFrame, actual: pd.DataFrame) -> pd.DataFrame:
    expected.sort_values(by=["country", "product", "expecteddate"], inplace=True)
    actual.sort_values(by=["country", "product", "actualdate"], inplace=True)

    expected["cumulative_qty"] = expected.groupby(["country", "product"]).agg(
        {"qty": np.cumsum}
    )
    expected.drop(columns="qty", inplace=True)
    actual["cumulative_qty"] = actual.groupby(["country", "product"]).agg(
        {"qty": np.cumsum}
    )
    actual.drop(columns="qty", inplace=True)
    orders_matched = pd.merge_ordered(
        actual,
        expected,
        on=["country", "product", "cumulative_qty"],
        how="outer",
    )
    del expected, actual

    orders_matched.sort_values(
        by=["country", "product", "cumulative_qty"], inplace=True
    )
    orders_matched["qty"] = orders_matched["cumulative_qty"] - orders_matched.groupby(
        ["country", "product"], sort=True
    )["cumulative_qty"].shift(1)
    is_first_in_group = orders_matched["qty"].isna()
    orders_matched["qty"][is_first_in_group] = orders_matched["cumulative_qty"][
        is_first_in_group
    ]
    orders_matched.drop(columns="cumulative_qty", inplace=True)
    orders_matched["qty"] = orders_matched["qty"].astype(int)

    orders_matched["actualdate"] = orders_matched.groupby(
        ["country", "product"], sort=True
    )["actualdate"].fillna(method="backfill")
    orders_matched["expecteddate"] = orders_matched.groupby(
        ["country", "product"], sort=True
    )["expecteddate"].fillna(method="backfill")

    return orders_matched


def label_matched_orders(orders_matched: pd.DataFrame) -> pd.Series:
    labels = pd.Series(index=orders_matched.index, name="label", data="")
    labels.loc[orders_matched["actualdate"].isna()] = "no actual date"
    labels.loc[orders_matched["expecteddate"].isna()] = "no expected date"

    both_dates_present_mask = (~orders_matched["actualdate"].isna()) & (
        ~orders_matched["actualdate"].isna()
    )
    labels.loc[
        both_dates_present_mask
        & (orders_matched["actualdate"] < orders_matched["expecteddate"])
    ] = "actual before expected"
    labels.loc[
        both_dates_present_mask
        & (orders_matched["actualdate"] == orders_matched["expecteddate"])
    ] = "same date"
    labels.loc[
        both_dates_present_mask
        & (orders_matched["actualdate"] > orders_matched["expecteddate"])
    ] = "expected before actual"

    return labels


def filter_and_relabel_actual(orders_matched: pd.DataFrame) -> pd.DataFrame:
    actual_labeled = orders_matched.loc[
        ~orders_matched["actualdate"].isna(),
        ["country", "product", "actualdate", "qty", "label"],
    ].copy()
    actual_labeled["label"] = actual_labeled["label"].map(
        {
            "actual before expected": "later expected date",
            "same date": "same expected date",
            "expected before actual": "earlier expected date",
            "no expected date": "no expected date",
            "no actual date": "no actual date",
        }
    )
    return actual_labeled


def filter_and_relabel_expected(orders_matched: pd.DataFrame) -> pd.DataFrame:
    expected_labeled = orders_matched.loc[
        ~orders_matched["expecteddate"].isna(),
        ["country", "product", "expecteddate", "qty", "label"],
    ].copy()
    expected_labeled["label"] = expected_labeled["label"].map(
        {
            "actual before expected": "earlier actual date",
            "same date": "same actual date",
            "expected before actual": "later actual date",
            "no actual date": "no actual date",
            "no expected date": "no expected date",
        }
    )
    return expected_labeled

说明

除了订单匹配和拆分部分外，代码很简单。但匹配和分割部分有点棘手。

让我们使用这个例子:在按日期订购后的单个(国家/地区，产品)组中，有数量为[100, 300]的expected_orders和数量为的actual_orders >[300, 100, 200]。我们可以这样画:

Expected qty:   |-100-|-------300-------|
Actual qty:     |-------300-------|-100-|----200----|

(每个订单数量绘制为一个段。段长度等于订单数量。段并排放在一条线上，保留顺序)。

让我们通过每个线段的端点绘制垂直线，将线段分成相同长度的部分:

Vertical lines: |-----|-----------|-----|-----------|
Expected split: |-100-|----200----|-100-|           .
Actual split:   |-100-|----200----|-100-|----200----|

使用这个数字，我们可以将订单分成相同数量的部分:

预期[100, 300] -> [100, 200, 100]
实际 [300, 100, 200] -> [100, 200, 100, 200]

此逻辑在 match_orders() 函数中实现:

通过评估组间数量的累积总和，分别列出实际订单和预期订单的垂直线位置。
通过合并组键和累积数量上的实际和预期数据帧来合并实际和预期的行位置
通过连续累积和之间的差异来评估段的长度
填写订单部件的实际日期和预期日期

def match_orders(expected: pd.DataFrame, actual: pd.DataFrame) -> pd.DataFrame:
    expected.sort_values(by=["country", "product", "expecteddate"], inplace=True)
    actual.sort_values(by=["country", "product", "actualdate"], inplace=True)

    expected["cumulative_qty"] = expected.groupby(["country", "product"]).agg(
        {"qty": np.cumsum}
    )
    expected.drop(columns="qty", inplace=True)
    actual["cumulative_qty"] = actual.groupby(["country", "product"]).agg(
        {"qty": np.cumsum}
    )
    actual.drop(columns="qty", inplace=True)

    orders_matched = pd.merge_ordered(
        actual,
        expected,
        on=["country", "product", "cumulative_qty"],
        how="outer",
    )
    del expected, actual

    orders_matched.sort_values(
        by=["country", "product", "cumulative_qty"], inplace=True
    )
    orders_matched["qty"] = orders_matched["cumulative_qty"] - orders_matched.groupby(
        ["country", "product"], sort=True
    )["cumulative_qty"].shift(1)
    is_first_in_group = orders_matched["qty"].isna()
    orders_matched["qty"][is_first_in_group] = orders_matched["cumulative_qty"][
        is_first_in_group
    ]
    orders_matched.drop(columns="cumulative_qty", inplace=True)
    orders_matched["qty"] = orders_matched["qty"].astype(int)

    orders_matched["actualdate"] = orders_matched.groupby(
        ["country", "product"], sort=True
    )["actualdate"].fillna(method="backfill")
    orders_matched["expecteddate"] = orders_matched.groupby(
        ["country", "product"], sort=True
    )["expecteddate"].fillna(method="backfill")

    return orders_matched

单组示例

输入:

expected = pd.DataFrame(
    {
        "country": "US",
        "product": "Pen",
        "expecteddate": ["2022-01-05", "2022-01-07"],
        "qty": [300, 500],
    }
)
print(expected)
actual = pd.DataFrame(
    {
        "country": "US",
        "product": "Pen",
        "actualdate": ["2022-01-05", "2022-01-08", "2022-01-09"],
        "qty": [100, 800, 200],
    }
)
print(actual)
expected_labeled, actual_labeled = label_orders(
    expected_orders=expected, actual_orders=actual
)
print(expected_labeled)
print(actual_labeled)

输出

  country product expecteddate  qty
0      US     Pen   2022-01-05  300
1      US     Pen   2022-01-07  500

  country product  actualdate  qty
0      US     Pen  2022-01-05  100
1      US     Pen  2022-01-08  800
2      US     Pen  2022-01-09  200

  country product expecteddate  qty              label
0      US     Pen   2022-01-05  100   same actual date
1      US     Pen   2022-01-05  200  later actual date
2      US     Pen   2022-01-07  500  later actual date

  country product  actualdate  qty                  label
0      US     Pen  2022-01-05  100     same expected date
1      US     Pen  2022-01-08  200  earlier expected date
2      US     Pen  2022-01-08  500  earlier expected date
3      US     Pen  2022-01-08  100       no expected date
4      US     Pen  2022-01-09  200       no expected date

多组示例:

输入:

expected = pd.DataFrame(
    {
        "country": ["US"] * 2 + ["Germany"] + ["Japan"] * 5,
        "product": ["Pen"] * 2 + ["Paper"] + ["Crayon"] * 5,
        "expecteddate": ["2022-01-05", "2022-01-07"]
        + ["2021-12-31"]
        + ["2022-03-15", "2022-03-16", "2022-03-16", "2022-03-17", "2022-03-17"],
        "qty": [100, 100, 2000, 100, 50, 150, 250, 50],
    }
)
print(expected)
actual = pd.DataFrame(
    {
        "country": ["US"] * 3 + ["Japan"] * 4,
        "product": ["Pen"] * 3 + ["Crayon"] * 4,
        "actualdate": ["2022-01-05", "2022-01-08", "2022-01-08"]
        + ["2022-03-15", "2022-03-15", "2022-03-19", "2022-03-19"],
        "qty": [100, 100, 100, 100, 50, 150, 250],
    }
)
print(actual)

expected_labeled, actual_labeled = label_orders(
    expected_orders=expected, actual_orders=actual
)
print(expected_labeled)
print(actual_labeled)

输出:

   country product expecteddate   qty
0       US     Pen   2022-01-05   100
1       US     Pen   2022-01-07   100
2  Germany   Paper   2021-12-31  2000
3    Japan  Crayon   2022-03-15   100
4    Japan  Crayon   2022-03-16    50
5    Japan  Crayon   2022-03-16   150
6    Japan  Crayon   2022-03-17   250
7    Japan  Crayon   2022-03-17    50

  country product  actualdate  qty
0      US     Pen  2022-01-05  100
1      US     Pen  2022-01-08  100
2      US     Pen  2022-01-08  100
3   Japan  Crayon  2022-03-15  100
4   Japan  Crayon  2022-03-15   50
5   Japan  Crayon  2022-03-19  150
6   Japan  Crayon  2022-03-19  250

   country product expecteddate   qty                label
0  Germany   Paper   2021-12-31  2000       no actual date
1    Japan  Crayon   2022-03-15   100     same actual date
2    Japan  Crayon   2022-03-16    50  earlier actual date
3    Japan  Crayon   2022-03-16   150    later actual date
4    Japan  Crayon   2022-03-17   250    later actual date
5    Japan  Crayon   2022-03-17    50       no actual date
6       US     Pen   2022-01-05   100     same actual date
7       US     Pen   2022-01-07   100    later actual date

  country product  actualdate  qty                  label
1   Japan  Crayon  2022-03-15  100     same expected date
2   Japan  Crayon  2022-03-15   50    later expected date
3   Japan  Crayon  2022-03-19  150  earlier expected date
4   Japan  Crayon  2022-03-19  250  earlier expected date
6      US     Pen  2022-01-05  100     same expected date
7      US     Pen  2022-01-08  100  earlier expected date
8      US     Pen  2022-01-08  100       no expected date

关于python - 在 Pandas/Pyspark 中比较 2 个数据帧、分配标签并拆分行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74619139/

python - 在 Pandas/Pyspark 中比较 2 个数据帧、分配标签并拆分行

说明

单组示例

输入:

输出

多组示例:

输入:

输出:

上一篇：带有正则表达式匹配条件和赋值的 Perl grep 式单行？

下一篇：r - 为什么 geom_密度绘制的数据与预期图像不同？

python - 在 Pandas/Pyspark 中比较 2 个数据帧、分配标签并拆分行

说明

单组示例

输入:

输出

多组示例:

输入:

输出:

上一篇：带有正则表达式匹配条件和赋值的 Perl grep 式单行？

下一篇：r - 为什么 geom_密度 绘制的数据与预期图像不同？

下一篇：r - 为什么 geom_密度绘制的数据与预期图像不同？