python - 如何使用 pandas 和 pytest 进行 TDD?

标签 python python-3.x pandas tdd pytest

我有一个 Python 脚本,它通过在一系列 DataFrame 操作(drop、groupby、sum 等)中始终使用 Pandas 来整合报告。假设我从一个简单的函数开始,该函数清除所有没有值的列,它有一个 DataFrame 作为输入和输出:

# cei.py
def clean_table_cols(source_df: pd.DataFrame) -> pd.DataFrame:
   # IMPLEMENTATION
   # eg. return source_df.dropna(axis="columns", how="all")

我想在我的测试中验证这个函数实际上删除了所有值都为空的列。所以我安排了一个测试输入和输出,并使用 pandas.testing 中的 assert_frame_equal 函数进行测试:

# test_cei.py
import pandas as pd
def test_clean_table_cols() -> None:
    df = pd.DataFrame(
        {
            "full_valued": [1, 2, 3],
            "all_missing1": [None, None, None],
            "some_missing": [None, 2, 3],
            "all_missing2": [None, None, None],
        }
    )
    expected = pd.DataFrame({"full_valued": [1, 2, 3], "some_missing": [None, 2, 3]})
    result = cei.clean_table_cols(df)
    pd.testing.assert_frame_equal(result, expected)

我的问题是它在概念上是单元测试还是 e2e/集成测试,因为我不是在 mock pandas 实现。但是如果我模拟 DataFrame,我就不会测试代码的功能。按照 TDD 最佳实践进行测试的推荐方法是什么?

注意:在此项目中使用 Pandas 是一项设计决策,因此我们无意抽象 Pandas 接口(interface)以便将来用其他库替换它。

最佳答案

您可能会找到 tdda (测试驱动数据分析)很有用,引用自文档:

The tdda package provides Python support for test-driven data analysis (see 1-page summary with references, or the blog). The tdda.referencetest library is used to support the creation of reference tests, based on either unittest or pytest. The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records. The tdda.rexpy library is a tool for automatically inferring regular expressions from a column in a Pandas DataFrame or from a (Python) list of examples. There is also a command-line utility for Rexpy. Although the library is provided as a Python package, and can be called through its Python API, it also provides command-line tools."

另见 Nick Radcliffe's PyData talk on Test-Driven Data Analysis

关于python - 如何使用 pandas 和 pytest 进行 TDD?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61291416/

相关文章:

python - 'TypeError : 'str' does not support the buffer interface' - python 3. 4 从 2.7 转换

python - python @properties 如何以及何时评估

python - 使用 importlib.import_module 处理导入模块引发的异常

python - 在selenium中使用 "webdriver.Chrome()"时出错

python - 将列表转换为 Python Dataframe 中的列

python - 保存行和列标题python的sklearn MinMaxScaler

python - 如何使用 pyspark 从 CSV 设置 Spark 中的 parquet 中正确的数据类型

python - 检查两个人是否通过 friend 联系

pandas 将 float64 转换为 int

python - Pandas .unique() TypeError : unhashable type: 'list'