azure - 运行 Spark.read.json 时在其中一个 json 中发现重复列，即使没有重复列

我在 PySpark 和 Synapse 数据流中遇到非常奇怪的错误。

我正在使用以下查询读取 JSON 文件，但出现重复列错误即使没有重复列。我可以使用其他工具和 JSON 验证器以及数据流来读取它，但不能在 PySpark 中读取。

PySpark 查询如下:

df = (
    spark.read.option("multiline", "true")
    .options(encoding="UTF-8")
    .load(
        "abfss://<Container>]@<DIR>.dfs.core.windows.net/export28.json", format="json"
    )
)

这是我得到的堆栈跟踪:

AnalysisException: Found duplicate column(s) in the data schema: amendationcommentkey, amendationreasonkey, amendationregulatoryproofkey Traceback (most recent call last):

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load return self._df(self._jreader.load(path))

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in call return_value = get_return_value(

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None

pyspark.sql.utils.AnalysisException: Found duplicate column(s) in the data schema: amendationcommentkey, amendationreasonkey, amendationregulatoryproofkey

最佳答案

这表明我们在顶级列以及嵌套结构中是否有任何重复的名称。

以下是 Apache Spark website 的声明:

In Spark 3.1, the Parquet, ORC, Avro and JSON datasources throw the exception org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema in read if they detect duplicate names in top-level columns as well in nested structures. The datasources take into account the SQL config spark.sql.caseSensitive while detecting column name duplicates.

尝试使用如下命令，因为一切都取决于架构，因为此代码成功地帮助了我的案例。

Sch = spark.read.json(schemaPath)
schema = Sch.schema

df = spark.read.option("multiline","true").schema(schema).json(f"{json_path}")

另请参阅这些 SO( SO1 、 SO2 、 SO3 )。作者在不同场景下给出了很好的解释。

关于azure - 运行 Spark.read.json 时在其中一个 json 中发现重复列，即使没有重复列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/70115293/

azure - 运行 Spark.read.json 时在其中一个 json 中发现重复列，即使没有重复列

上一篇：对资源组强制执行命名约定的 Azure 策略没有效果

下一篇：sql-server - 在 Azure VM 中将 Azure AD 与 SSRS 集成