python - pyspark Pandas udf 运行时错误 : Number of columns of the returned doesn't match specified schema

我在下面定义了 Pandas udf

schema2 = StructType([   StructField('sensorid', IntegerType(), True),
    StructField('confidence', DoubleType(), True)]) 

@pandas_udf(schema2,  PandasUDFType.GROUPED_MAP)   
def PreProcess(Indf):   
    confidence=1  
    sensor=Indf.iloc[0,0]   
    df = pd.DataFrame(columns=['sensorid','confidence'])  
    df['sensorid']=[sensor]   
    df['confidence']=[0]   
    return df

然后我将一个带有 3 列的 spark 数据框传递到该 udf

results.groupby("sensorid").apply(PreProcess)

results:
+--------+---------------+---------------+
|sensorid|sensortimestamp|calculatedvalue|
+--------+---------------+---------------+
|  397332|     1596518086|          -39.0|
|  397332|     1596525586|          -31.0|

但我不断收到此错误:

RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema.Expected: 3 Actual: 4

我可以判断错误试图说什么，但我不明白这个错误是如何弹出的。我以为我正在返回结构中指定的数据框的正确 2 列

最佳答案

apply 已弃用，似乎期望返回相同的输入列，在本例中为 3。尝试使用 applyInPandas 使用预期的输出模式:

results.groupby("sensorid").applyInPandas(PreProcess, schema=schema2)

更新了最新版本的链接。 (Spark 的文档更改和链接已损坏)
在 3.0.0 版中: apply applyInPandas

关于python - pyspark Pandas udf 运行时错误 : Number of columns of the returned doesn't match specified schema，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63403001/

上一篇：python - 如何使用 NumPy 将整数向量转换为二进制表示的矩阵？

下一篇：d3.js - 如何在 d3 中在两条不同长度(不同 x 值)的曲线之间绘制区域/填充

相关文章：

python - Kmeans 返回的集群可视化

python - 从现有文件中提取多个新的制表符分隔文件

python - 计算(格式化)相同日期时，Pandas DatetimeIndex 和 to_datetime 存在差异

python - 从系列中删除非数值

python - Scala/Python 中这两个映射表达式有什么区别？

Python-将数据帧中的值附加到相应的元组元素

python - 如何在 Python 中比较音频的相似性？

python - 在 Python 中嵌套 for 循环来制作三角形图案

python - SparkSubmitOperator 部署模式

python - lambda 函数中的值太多，无法解压