apache-spark - pySpark 使用键/值从 RDD 创建数据帧

如果我有一个键/值的 RDD(键是列索引)，是否可以将它加载到数据帧中？
例如:

(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)

并让数据框看起来像:

1,2,18
1,10,18
2,20,18

最佳答案

是的，这是可能的(用 Spark 1.3.1 测试):

>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
>>> sqlContext.createDataFrame(rdd, ["id", "score"])
Out[2]: DataFrame[id: bigint, score: bigint]

关于apache-spark - pySpark 使用键/值从 RDD 创建数据帧，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30007200/

上一篇：apache-spark - 结合 PyCharm、Spark 和 Jupyter

下一篇：fiware-orion - 发布后订阅状态发生变化

python - 如何从 Spark SQL 查询 [PySpark] 获取表名？

apache-spark - 如何检查结构化流中的StreamingQuery性能指标？

apache-spark - yarn : Less executor memory than set via spark-submit 上的 Spark

apache-spark - 为什么停止独立 Spark master 会失败并显示 "no org.apache.spark.deploy.master.Master to stop"？

python - 将包含多种字符串日期格式的列转换为 Spark 中的 DateTime

python - 将时间戳舍入到最接近的 30 秒

python - Spark Dataframe 在转换后选择列

csv - 尝试通过spark-sql查询csv格式配置单元表时出现问题。有人可以解释原因吗？

scala - 如何在 Zeppelin/Spark/Scala 中漂亮地打印一个wrappedarray？