我正在尝试将 CSV 文件加载到 spark DataFrame 中。这是我到目前为止所做的:
# sc is an SparkContext.
appName = "testSpark"
master = "local"
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
# csv path
text_file = sc.textFile("hdfs:///path/to/sensordata20171008223515.csv")
df = sqlContext.load(source="com.databricks.spark.csv", header = 'true', path = text_file)
print df.schema()
这是痕迹:
Traceback (most recent call last):
File "/home/centos/main.py", line 16, in <module>
df = sc.textFile(text_file).map(lambda line: (line.split(';')[0], line.split(';')[1])).collect()
File "/usr/hdp/2.5.6.0-40/spark/python/lib/pyspark.zip/pyspark/context.py", line 474, in textFile
File "/usr/hdp/2.5.6.0-40/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 804, in __call__
File "/usr/hdp/2.5.6.0-40/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 278, in get_command_part
AttributeError: 'RDD' object has no attribute '_get_object_id'
我是新手。因此,如果有人能告诉我我做错了什么,这将非常有帮助。
最佳答案
您不能将 RDD 传递给 csv 阅读器。你应该直接使用路径:
df = sqlContext.load(source="com.databricks.spark.csv",
header = 'true', path = "hdfs:///path/to/sensordata20171008223515.csv")
只有少数格式(尤其是 JSON)支持 RDD 作为输入参数。
关于python - PySpark 加载 CSV AttributeError : 'RDD' object has no attribute '_get_object_id' ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45633302/