apache-spark - pyspark.sql.utils.AnalysisException : Column ambiguous but no duplicate column names

在加入数据帧的 id 列时，我遇到了不明确的列异常，但数据帧中没有重复的列。什么可能导致抛出此错误？

连接操作，其中a和input已被其他函数处理:

b = (
        input
        .where(F.col('st').like('%VALUE%'))
        .select('id', 'sii')
    )
a.join(b, b['id'] == a['item'])

数据框:

(Pdb) a.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[item#25280L,sii#24665L]

(Pdb) b.explain()
== Physical Plan ==
*(1) Project [id#23711L, sii#24665L]
+- *(1) Filter (isnotnull(st#25022) AND st#25022 LIKE %VALUE%)
   +- *(1) Scan ExistingRDD[id#23711L,st#25022,sii#24665L]

异常(exception):

pyspark.sql.utils.AnalysisException: Column id#23711L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;

如果我使用相同的架构重新创建数据框，我不会收到任何错误:

b_clean = spark_session.createDataFrame([], b.schema)
a.join(b_clean, b_clean['id'] == a['item'])

我可以通过什么来解决原始数据帧中发生的导致不明确列错误的问题？

最佳答案

此错误以及您的 sii 列在两个表中具有相同 id 的事实(即 sii#24665L)表明两个 a和 b 数据帧是使用相同的源制作的。因此，本质上，这使您的加入成为自加入(正是错误消息所告诉的内容)。在这种情况下，建议对数据帧使用别名。试试这个:

a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))

同样，在某些系统中您可能无法保存结果，因为生成的数据帧将有 2 个 sii 列。我建议仅显式选择您需要的列。如果您决定需要两个重复的列，那么使用别名重命名列也可能会有所帮助。例如:

df = (
    a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
    .select('item',
            'id',
            F.col('a.sii').alias('a_sii')
    )
)

关于apache-spark - pyspark.sql.utils.AnalysisException : Column ambiguous but no duplicate column names，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71976272/

apache-spark - pyspark.sql.utils.AnalysisException : Column ambiguous but no duplicate column names

上一篇：Pythonic 方式来计算 Counter 中最常见的重复项目，分别按元素数量？

下一篇：javascript - 数组中的任何更改都会更改整个数组吗？