python - Databricks 中的 PySpark 将表转换为 pandas 时出错

我正在使用 Databricks，并希望使用 df.toPandas() 命令将我的 PySpark DataFrame 转换为 pandas 数据框。

但是，我不断收到此错误:

/databricks/spark/python/pyspark/sql/pandas/conversion.py:145: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
  'DataFrame' object has no attribute 'dtype'
  warnings.warn(msg)
AttributeError: 'DataFrame' object has no attribute 'dtype'

我尝试了不同的方法，包括:

spark.conf.set("spark.sql.execution.arrow.enabled", "false")

但到目前为止没有任何效果(我还检查了其他一些存在此问题的帖子，但没有任何帮助)。

更新:df.printSchema()的结果:

flight_id: string (nullable = true)
 |-- flight_direction: string (nullable = true)
 |-- service_type: string (nullable = true)
 |-- flight_designator: string (nullable = true)
 |-- flight_number: string (nullable = true)
 |-- callsign: string (nullable = true)
 |-- scheduled_datetime: timestamp (nullable = true)
 |-- connecting_flight_designator: string (nullable = true)
 |-- airport_iata_codes: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- airline_name: string (nullable = true)
 |-- airport_names: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- country_number: long (nullable = true)
 |-- eu_category: string (nullable = true)
 |-- safe_town_indicator: boolean (nullable = true)
 |-- sibt: timestamp (nullable = true)
 |-- aibt: timestamp (nullable = true)
 |-- sobt: timestamp (nullable = true)
 |-- aibt: timestamp (nullable = true)
 |-- tsat: timestamp (nullable = true)
 |-- aircraft_name: string (nullable = true)
 |-- aircraft_registration: string (nullable = true)
 |-- ramp: string (nullable = true)
 |-- ramp_previous: string (nullable = true)
 |-- seats: long (nullable = true)
 |-- actual_total_pax: integer (nullable = true)
 |-- handler_apron: string (nullable = true)
 |-- occupancy_rate: double (nullable = false)

最佳答案

数据过滤出现问题。存在重复的列。如果以后有人遇到类似问题，请检查此。

关于python - Databricks 中的 PySpark 将表转换为 pandas 时出错，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/75602965/

python - Databricks 中的 PySpark 将表转换为 pandas 时出错

上一篇：database - 如何在 IBM Cloud MongoDB 数据库上设置自动缩放？

下一篇：json - JOLT 仅在不为 NULL 时连接