apache-spark - 从 Row 创建 DataFrame 结果为 'infer schema issue'

标签 apache-spark pyspark apache-spark-sql

开始学习PySpark时,我用一个列表创建了一个dataframe .现在从列表中推断模式已被弃用,我收到警告并建议我使用 pyspark.sql.Row反而。但是,当我尝试使用 Row 创建一个时,我得到推断架构问题。这是我的代码:

>>> row = Row(name='Severin', age=33)
>>> df = spark.createDataFrame(row)

这会导致以下错误:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/spark2-client/python/pyspark/sql/session.py", line 390, in _createFromLocal
    struct = self._inferSchemaFromList(data)
  File "/spark2-client/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
    schema = reduce(_merge_type, map(_infer_schema, data))
  File "/spark2-client/python/pyspark/sql/types.py", line 992, in _infer_schema
    raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>

所以我创建了一个架构

>>> schema = StructType([StructField('name', StringType()), 
...                      StructField('age',IntegerType())])
>>> df = spark.createDataFrame(row, schema)

但是,这个错误被抛出。
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/spark2-client/python/pyspark/sql/session.py", line 387, in _createFromLocal
    data = list(data)
  File "/spark2-client/python/pyspark/sql/session.py", line 509, in prepare
    verify_func(obj, schema)
  File "/spark2-client/python/pyspark/sql/types.py", line 1366, in _verify_type
    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 33 in type <type 'int'>

最佳答案

createDataFrame函数需要一个 行列表 (在其他选项中)加上模式,所以正确的代码应该是这样的:

from pyspark.sql.types import *
from pyspark.sql import Row

schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)

df.printSchema()
df.show()

出去:
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

+-------+---+
|   name|age|
+-------+---+
|Severin| 33|
|   John| 48|
+-------+---+

在 pyspark 文档 ( link ) 中,您可以找到有关 createDataFrame 函数的更多详细信息。

关于apache-spark - 从 Row 创建 DataFrame 结果为 'infer schema issue',我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44948465/

相关文章:

apache-spark - 舍入数据框中的所有列 - 小数点后两位 pyspark

scala - 删除 Spark 数据框中的所有重复记录

python - 入门 - Spark, IPython notebook with pyspark

python - Pyspark:如何过滤两列值对的列表?

python - 将RDD保存为pyspark中的序列文件

python - 混合模式 CSV 导入 Pyspark

apache-spark - Spark/Parquet 分区是否保持顺序?

python-3.x - 在同一个 jvm 中同时运行多个 Spark 实例的最佳实践?

python - 如何使用 approx_count_distinct 计算 Spark DataFrame 中两列的不同组合?

mysql - 如何从sql中的 `describe`语句创建一个新表?