python - 在 Pandas/Pyspark 中比较 2 个数据帧、分配标签并拆分行

标签 python pandas numpy pyspark apache-spark-sql

我有 2 个数据框,其中包含预期订单和实际订单详细信息。

输入数据:

input

我想在两个数据框中创建一个标签字段,并根据以下条件拆分行:

  • 按国家/地区、产品和日期排序
  • 按国家/地区和产品对数据框进行分组
  • 在两个数据框中,对于每个组,如果行的日期和数量匹配,则分配标签相同的实际日期/相同的预期日期
  • 如果数量匹配但日期不同,则分配标签(较早的预期日期/较晚的预期日期)和(较早的实际日期/较晚的实际日期)
  • 如果数量不完全匹配,但该组的其他数据框中剩余数量值,则将数量值 df 较大的行拆分为 2 行:匹配(较少)数量值和剩余值
  • 重复步骤,除非所有行都有标签
  • 如果其他组中没有剩余数量,则为标签分配无实际日期或无预期日期

预期输出:

expected output

我正在尝试使用嵌套循环来执行此操作,但是对于数百万行,这非常慢。

for key, exp in expected_grouped:
  act = actual_grouped.get_group(key)
  ...
  for i, outerrow in enumerate(exp.itertuples()):
    for j, innerrow in enumerate(act.itertuples()):
      if: ...
      elif: ...

有没有更好更快的方法来做到这一点?任何改进建议将不胜感激。

最佳答案

您可以使用下面的label_orders函数:

import pandas as pd
import numpy as np
from typing import Tuple


def label_orders(
    expected_orders: pd.DataFrame,
    actual_orders: pd.DataFrame,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    if not (actual_orders["qty"] > 0).all():
        raise AssertionError("qty must be positive")
    if not (expected_orders["qty"] > 0).all():
        raise AssertionError("qty must be positive")

    orders_matched = match_orders(expected_orders, actual_orders)
    del expected_orders, actual_orders

    orders_matched["label"] = label_orders_matched(orders_matched)
    expected_labeled = filter_and_relabel_expected(orders_matched)
    actual_labeled = filter_and_relabel_actual(orders_matched)
    return expected_labeled, actual_labeled


def match_orders(expected: pd.DataFrame, actual: pd.DataFrame) -> pd.DataFrame:
    expected.sort_values(by=["country", "product", "expecteddate"], inplace=True)
    actual.sort_values(by=["country", "product", "actualdate"], inplace=True)

    expected["cumulative_qty"] = expected.groupby(["country", "product"]).agg(
        {"qty": np.cumsum}
    )
    expected.drop(columns="qty", inplace=True)
    actual["cumulative_qty"] = actual.groupby(["country", "product"]).agg(
        {"qty": np.cumsum}
    )
    actual.drop(columns="qty", inplace=True)
    orders_matched = pd.merge_ordered(
        actual,
        expected,
        on=["country", "product", "cumulative_qty"],
        how="outer",
    )
    del expected, actual

    orders_matched.sort_values(
        by=["country", "product", "cumulative_qty"], inplace=True
    )
    orders_matched["qty"] = orders_matched["cumulative_qty"] - orders_matched.groupby(
        ["country", "product"], sort=True
    )["cumulative_qty"].shift(1)
    is_first_in_group = orders_matched["qty"].isna()
    orders_matched["qty"][is_first_in_group] = orders_matched["cumulative_qty"][
        is_first_in_group
    ]
    orders_matched.drop(columns="cumulative_qty", inplace=True)
    orders_matched["qty"] = orders_matched["qty"].astype(int)

    orders_matched["actualdate"] = orders_matched.groupby(
        ["country", "product"], sort=True
    )["actualdate"].fillna(method="backfill")
    orders_matched["expecteddate"] = orders_matched.groupby(
        ["country", "product"], sort=True
    )["expecteddate"].fillna(method="backfill")

    return orders_matched


def label_matched_orders(orders_matched: pd.DataFrame) -> pd.Series:
    labels = pd.Series(index=orders_matched.index, name="label", data="")
    labels.loc[orders_matched["actualdate"].isna()] = "no actual date"
    labels.loc[orders_matched["expecteddate"].isna()] = "no expected date"

    both_dates_present_mask = (~orders_matched["actualdate"].isna()) & (
        ~orders_matched["actualdate"].isna()
    )
    labels.loc[
        both_dates_present_mask
        & (orders_matched["actualdate"] < orders_matched["expecteddate"])
    ] = "actual before expected"
    labels.loc[
        both_dates_present_mask
        & (orders_matched["actualdate"] == orders_matched["expecteddate"])
    ] = "same date"
    labels.loc[
        both_dates_present_mask
        & (orders_matched["actualdate"] > orders_matched["expecteddate"])
    ] = "expected before actual"

    return labels


def filter_and_relabel_actual(orders_matched: pd.DataFrame) -> pd.DataFrame:
    actual_labeled = orders_matched.loc[
        ~orders_matched["actualdate"].isna(),
        ["country", "product", "actualdate", "qty", "label"],
    ].copy()
    actual_labeled["label"] = actual_labeled["label"].map(
        {
            "actual before expected": "later expected date",
            "same date": "same expected date",
            "expected before actual": "earlier expected date",
            "no expected date": "no expected date",
            "no actual date": "no actual date",
        }
    )
    return actual_labeled


def filter_and_relabel_expected(orders_matched: pd.DataFrame) -> pd.DataFrame:
    expected_labeled = orders_matched.loc[
        ~orders_matched["expecteddate"].isna(),
        ["country", "product", "expecteddate", "qty", "label"],
    ].copy()
    expected_labeled["label"] = expected_labeled["label"].map(
        {
            "actual before expected": "earlier actual date",
            "same date": "same actual date",
            "expected before actual": "later actual date",
            "no actual date": "no actual date",
            "no expected date": "no expected date",
        }
    )
    return expected_labeled

说明

除了订单匹配和拆分部分外,代码很简单。但匹配和分割部分有点棘手。

让我们使用这个例子:在按日期订购后的单个(国家/地区,产品)组中,有数量为[100, 300]的expected_orders和数量为的actual_orders >[300, 100, 200]。我们可以这样画:

Expected qty:   |-100-|-------300-------|
Actual qty:     |-------300-------|-100-|----200----|

(每个订单数量绘制为一个段。段长度等于订单数量。段并排放在一条线上,保留顺序)。

让我们通过每个线段的端点绘制垂直线,将线段分成相同长度的部分:

Vertical lines: |-----|-----------|-----|-----------|
Expected split: |-100-|----200----|-100-|           .
Actual split:   |-100-|----200----|-100-|----200----|

使用这个数字,我们可以将订单分成相同数量的部分:

  • 预期[100, 300] -> [100, 200, 100]
  • 实际 [300, 100, 200] -> [100, 200, 100, 200]

此逻辑在 match_orders() 函数中实现:

  • 通过评估组间数量的累积总和,分别列出实际订单和预期订单的垂直线位置。
  • 通过合并组键和累积数量上的实际和预期数据帧来合并实际和预期的行位置
  • 通过连续累积和之间的差异来评估段的长度
  • 填写订单部件的实际日期预期日期
def match_orders(expected: pd.DataFrame, actual: pd.DataFrame) -> pd.DataFrame:
    expected.sort_values(by=["country", "product", "expecteddate"], inplace=True)
    actual.sort_values(by=["country", "product", "actualdate"], inplace=True)

    expected["cumulative_qty"] = expected.groupby(["country", "product"]).agg(
        {"qty": np.cumsum}
    )
    expected.drop(columns="qty", inplace=True)
    actual["cumulative_qty"] = actual.groupby(["country", "product"]).agg(
        {"qty": np.cumsum}
    )
    actual.drop(columns="qty", inplace=True)

    orders_matched = pd.merge_ordered(
        actual,
        expected,
        on=["country", "product", "cumulative_qty"],
        how="outer",
    )
    del expected, actual

    orders_matched.sort_values(
        by=["country", "product", "cumulative_qty"], inplace=True
    )
    orders_matched["qty"] = orders_matched["cumulative_qty"] - orders_matched.groupby(
        ["country", "product"], sort=True
    )["cumulative_qty"].shift(1)
    is_first_in_group = orders_matched["qty"].isna()
    orders_matched["qty"][is_first_in_group] = orders_matched["cumulative_qty"][
        is_first_in_group
    ]
    orders_matched.drop(columns="cumulative_qty", inplace=True)
    orders_matched["qty"] = orders_matched["qty"].astype(int)

    orders_matched["actualdate"] = orders_matched.groupby(
        ["country", "product"], sort=True
    )["actualdate"].fillna(method="backfill")
    orders_matched["expecteddate"] = orders_matched.groupby(
        ["country", "product"], sort=True
    )["expecteddate"].fillna(method="backfill")

    return orders_matched

单组示例

输入:

expected = pd.DataFrame(
    {
        "country": "US",
        "product": "Pen",
        "expecteddate": ["2022-01-05", "2022-01-07"],
        "qty": [300, 500],
    }
)
print(expected)
actual = pd.DataFrame(
    {
        "country": "US",
        "product": "Pen",
        "actualdate": ["2022-01-05", "2022-01-08", "2022-01-09"],
        "qty": [100, 800, 200],
    }
)
print(actual)
expected_labeled, actual_labeled = label_orders(
    expected_orders=expected, actual_orders=actual
)
print(expected_labeled)
print(actual_labeled)

输出

  country product expecteddate  qty
0      US     Pen   2022-01-05  300
1      US     Pen   2022-01-07  500

  country product  actualdate  qty
0      US     Pen  2022-01-05  100
1      US     Pen  2022-01-08  800
2      US     Pen  2022-01-09  200

  country product expecteddate  qty              label
0      US     Pen   2022-01-05  100   same actual date
1      US     Pen   2022-01-05  200  later actual date
2      US     Pen   2022-01-07  500  later actual date

  country product  actualdate  qty                  label
0      US     Pen  2022-01-05  100     same expected date
1      US     Pen  2022-01-08  200  earlier expected date
2      US     Pen  2022-01-08  500  earlier expected date
3      US     Pen  2022-01-08  100       no expected date
4      US     Pen  2022-01-09  200       no expected date

多组示例:

输入:

expected = pd.DataFrame(
    {
        "country": ["US"] * 2 + ["Germany"] + ["Japan"] * 5,
        "product": ["Pen"] * 2 + ["Paper"] + ["Crayon"] * 5,
        "expecteddate": ["2022-01-05", "2022-01-07"]
        + ["2021-12-31"]
        + ["2022-03-15", "2022-03-16", "2022-03-16", "2022-03-17", "2022-03-17"],
        "qty": [100, 100, 2000, 100, 50, 150, 250, 50],
    }
)
print(expected)
actual = pd.DataFrame(
    {
        "country": ["US"] * 3 + ["Japan"] * 4,
        "product": ["Pen"] * 3 + ["Crayon"] * 4,
        "actualdate": ["2022-01-05", "2022-01-08", "2022-01-08"]
        + ["2022-03-15", "2022-03-15", "2022-03-19", "2022-03-19"],
        "qty": [100, 100, 100, 100, 50, 150, 250],
    }
)
print(actual)

expected_labeled, actual_labeled = label_orders(
    expected_orders=expected, actual_orders=actual
)
print(expected_labeled)
print(actual_labeled)

输出:

   country product expecteddate   qty
0       US     Pen   2022-01-05   100
1       US     Pen   2022-01-07   100
2  Germany   Paper   2021-12-31  2000
3    Japan  Crayon   2022-03-15   100
4    Japan  Crayon   2022-03-16    50
5    Japan  Crayon   2022-03-16   150
6    Japan  Crayon   2022-03-17   250
7    Japan  Crayon   2022-03-17    50

  country product  actualdate  qty
0      US     Pen  2022-01-05  100
1      US     Pen  2022-01-08  100
2      US     Pen  2022-01-08  100
3   Japan  Crayon  2022-03-15  100
4   Japan  Crayon  2022-03-15   50
5   Japan  Crayon  2022-03-19  150
6   Japan  Crayon  2022-03-19  250

   country product expecteddate   qty                label
0  Germany   Paper   2021-12-31  2000       no actual date
1    Japan  Crayon   2022-03-15   100     same actual date
2    Japan  Crayon   2022-03-16    50  earlier actual date
3    Japan  Crayon   2022-03-16   150    later actual date
4    Japan  Crayon   2022-03-17   250    later actual date
5    Japan  Crayon   2022-03-17    50       no actual date
6       US     Pen   2022-01-05   100     same actual date
7       US     Pen   2022-01-07   100    later actual date

  country product  actualdate  qty                  label
1   Japan  Crayon  2022-03-15  100     same expected date
2   Japan  Crayon  2022-03-15   50    later expected date
3   Japan  Crayon  2022-03-19  150  earlier expected date
4   Japan  Crayon  2022-03-19  250  earlier expected date
6      US     Pen  2022-01-05  100     same expected date
7      US     Pen  2022-01-08  100  earlier expected date
8      US     Pen  2022-01-08  100       no expected date

关于python - 在 Pandas/Pyspark 中比较 2 个数据帧、分配标签并拆分行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74619139/

相关文章:

C# 等价于 Python 的范围与步骤?

python - 更改蒙版的NP数组中的值

Python Pandas 数据框 : Find last occurrence of value less-than-or-equal-to current row

python - 在 python 中有效投影二分图(使用 networkx)

python - 有没有办法从其值中获取字典条目的键?

python - 根据现有列中的条件在数据框中创建新列

python - 通过使用 pandas 在时间序列中将值分配给先前的 NaN 来回填值

python - 有没有更好的方法来广播数组?

Python 多维数组的快速数组乘法

python - 在 Python 中映射数组的好方法是什么?