python - Spark SQL : TypeError ("StructType can not accept object in type %s" % type(obj))

标签 python apache-spark apache-spark-sql

我目前正在使用 PyODBC 从 SQL Server 中提取数据,并尝试以近实时 (NRT) 的方式将数据插入到 Hive 中的表中。

我从源中获取了一行并转换为 List[Strings] 并以编程方式创建模式,但是在创建 DataFrame 时,Spark 抛出 StructType 错误。

>>> cnxn = pyodbc.connect(con_string)
>>> aj = cnxn.cursor()
>>>
>>> aj.execute("select * from tjob")
<pyodbc.Cursor object at 0x257b2d0>

>>> row = aj.fetchone()

>>> row
(1127, u'', u'8196660', u'', u'', 0, u'', u'', None, 35, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, u'', 0, None, None)
>>> rowstr = map(str,row)
>>> rowstr
['1127', '', '8196660', '', '', '0', '', '', 'None', '35', 'None', '0', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', '', '0', 'None', 'None']

>>> schemaString = " ".join([row.column_name for row in aj.columns(table='tjob')])

>>> schemaString
u'ID ExternalID Name Description Notes Type Lot SubLot ParentJobID ProductID PlannedStartDateTime PlannedDurationSeconds Capture01 Capture02 Capture03 Capture04 Capture05 Capture06 Capture07 Capture08 Capture09 Capture10 Capture11 Capture12 Capture13 Capture14 Capture15 Capture16 Capture17 Capture18 Capture19 Capture20 User UserState ModifiedDateTime UploadedDateTime'

>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)

>>> [f.dataType for f in schema.fields]
[StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType, StringType]

>>> myrdd = sc.parallelize(rowstr)

>>> myrdd.collect()
['1127', '', '8196660', '', '', '0', '', '', 'None', '35', 'None', '0', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', '', '0', 'None', 'None']

>>> schemaPeople = sqlContext.createDataFrame(myrdd, schema)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/context.py", line 404, in createDataFrame
    rdd, schema = self._createFromRDD(data, schema, samplingRatio)
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/context.py", line 298, in _createFromRDD
    _verify_type(row, schema)
  File "/apps/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark/python/pyspark/sql/types.py", line 1132, in _verify_type
    raise TypeError("StructType can not accept object in type %s" % type(obj))
TypeError: StructType can not accept object in type <type 'str'>

最佳答案

这里是错误信息的原因:

>>> rowstr
['1127', '', '8196660', '', '', '0', '', '', 'None' ... ]   
#rowstr is a list of str

>>> myrdd = sc.parallelize(rowstr)
#myrdd is a rdd of str

>>> schema = StructType(fields)
#schema is StructType([StringType, StringType, ....])

>>> schemaPeople = sqlContext.createDataFrame(myrdd, schema)
#myrdd should have been RDD([StringType, StringType,...]) but is RDD(str)

要解决这个问题,请创建正确类型的 RDD:

>>> myrdd = sc.parallelize([rowstr])

关于python - Spark SQL : TypeError ("StructType can not accept object in type %s" % type(obj)),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36676594/

相关文章:

python - 当我运行 heroku ps :scale web=1 command i am getting this error. 任何人都可以帮助我吗

python - 如何对 pyspark 中的一组列进行分桶?

python - 通过排除使用 isin 过滤 pyspark 数据帧

java - 如何反转或求补 Java 中返回 boolean 值的函数的输出

python - 使用 posexplode 分解带有索引的嵌套 JSON

python - 字符串日期至今( Pandas )

python - 在gekko中一起建模微分方程和线性方程?

python - 使用 BeautifulSoup 获取文档 DOCTYPE

python - 是否可以从 Scala(spark) 调用 python 函数

apache-spark - 如何附加到 HDFS 中的同一文件(spark 2.11)