我在 PySpark 和 Synapse 数据流中遇到非常奇怪的错误。
我正在使用以下查询读取 JSON 文件,但出现重复列错误即使没有重复列。我可以使用其他工具和 JSON 验证器以及数据流来读取它,但不能在 PySpark 中读取。
PySpark 查询如下:
df = (
spark.read.option("multiline", "true")
.options(encoding="UTF-8")
.load(
"abfss://<Container>]@<DIR>.dfs.core.windows.net/export28.json", format="json"
)
)
这是我得到的堆栈跟踪:
AnalysisException: Found duplicate column(s) in the data schema:
amendationcommentkey
,amendationreasonkey
,amendationregulatoryproofkey
Traceback (most recent call last):File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load return self._df(self._jreader.load(path))
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in call return_value = get_return_value(
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None
pyspark.sql.utils.AnalysisException: Found duplicate column(s) in the data schema:
amendationcommentkey
,amendationreasonkey
,amendationregulatoryproofkey
最佳答案
这表明我们在顶级列以及嵌套结构中是否有任何重复的名称。
以下是 Apache Spark website 的声明:
In Spark 3.1, the Parquet, ORC, Avro and JSON datasources throw the exception
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema
in read if they detect duplicate names in top-level columns as well in nested structures. The datasources take into account the SQL configspark.sql.caseSensitive
while detecting column name duplicates.
尝试使用如下命令,因为一切都取决于架构,因为此代码成功地帮助了我的案例。
Sch = spark.read.json(schemaPath)
schema = Sch.schema
df = spark.read.option("multiline","true").schema(schema).json(f"{json_path}")
关于azure - 运行 Spark.read.json 时在其中一个 json 中发现重复列,即使没有重复列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70115293/