python - Databricks 中的 PySpark 将表转换为 pandas 时出错

标签 python pandas apache-spark pyspark databricks

我正在使用 Databricks,并希望使用 df.toPandas() 命令将我的 PySpark DataFrame 转换为 pandas 数据框。

但是,我不断收到此错误:

/databricks/spark/python/pyspark/sql/pandas/conversion.py:145: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
  'DataFrame' object has no attribute 'dtype'
  warnings.warn(msg)
AttributeError: 'DataFrame' object has no attribute 'dtype'

我尝试了不同的方法,包括:

spark.conf.set("spark.sql.execution.arrow.enabled", "false")

但到目前为止没有任何效果(我还检查了其他一些存在此问题的帖子,但没有任何帮助)。

更新:df.printSchema()的结果:

flight_id: string (nullable = true)
 |-- flight_direction: string (nullable = true)
 |-- service_type: string (nullable = true)
 |-- flight_designator: string (nullable = true)
 |-- flight_number: string (nullable = true)
 |-- callsign: string (nullable = true)
 |-- scheduled_datetime: timestamp (nullable = true)
 |-- connecting_flight_designator: string (nullable = true)
 |-- airport_iata_codes: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- airline_name: string (nullable = true)
 |-- airport_names: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- country_number: long (nullable = true)
 |-- eu_category: string (nullable = true)
 |-- safe_town_indicator: boolean (nullable = true)
 |-- sibt: timestamp (nullable = true)
 |-- aibt: timestamp (nullable = true)
 |-- sobt: timestamp (nullable = true)
 |-- aibt: timestamp (nullable = true)
 |-- tsat: timestamp (nullable = true)
 |-- aircraft_name: string (nullable = true)
 |-- aircraft_registration: string (nullable = true)
 |-- ramp: string (nullable = true)
 |-- ramp_previous: string (nullable = true)
 |-- seats: long (nullable = true)
 |-- actual_total_pax: integer (nullable = true)
 |-- handler_apron: string (nullable = true)
 |-- occupancy_rate: double (nullable = false)

最佳答案

数据过滤出现问题。存在重复的列。如果以后有人遇到类似问题,请检查此。

关于python - Databricks 中的 PySpark 将表转换为 pandas 时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75602965/

相关文章:

python - 当在查询中使用fields选项时,elasticsearch-py搜索返回列表对象

python - 如何使用Python源文件的 'coding' header 正确读取其内容?

python - 在 pandas 列中插入字典值

azure - 如何使用databricks-connect在本地执行Spark代码?

python - wx.FindWindowByName() 的例子

python - 将 MySQL blob 内容作为 json 响应发送

python - 通过检查连续元素来切片数据帧

python - 将不规则时间序列转换为python pandas中的每小时数据

apache-spark - 更新 RDD 中的广播变量

scala - 如何在数据框中的列上创建 bin