python-3.x - 装满 Spark 数据帧-pyspark

标签 python-3.x apache-spark hadoop pyspark bigdata

我有一个 Spark 数据框与列(年龄)。我需要编写一个pyspark脚本以将数据帧存储在10岁以下的年龄段(例如11-20岁，21-30岁...等)中，并找到每个年龄段条目的数量。通过这个

对于前:

我有以下数据框

+-----+
|age  |  
+-----+
|   21|      
|   23|     
|   35|     
|   39|    
+-----+

装桶后(预期)

+-----+------+
|age  | count|
+-----+------+
|21-30|    2 |    
|31-40|    2 |      
+-----+------+

最佳答案

一种简单的计算方法是在基础RDD上计算直方图。

给定已知的年龄范围(幸运的是，这很容易组合在一起-在这里使用1、11、21等)，生成直方图相当容易:

hist = df.rdd\
  .map(lambda l: l['age'])\
  .histogram([1, 11, 21,31,41,51,61,71,81,91])

这将返回一个带有“年龄范围”及其相应观察计数的元组，如下所示:

([1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
  [10, 10, 10, 10, 10, 10, 10, 10, 11])

然后，您可以使用以下方法将其转换回数据框:

#Use zip to link age_ranges to their counts
countTuples = zip(hist[0], hist[1])
#make a list from that
ageList = list(map(lambda l: Row(age_range=l[0], count=l[1]), countTuples))
sc.parallelize(ageList).toDF()

有关更多信息，请查看the RDD API中的histogram函数的文档。

关于python-3.x - 装满 Spark 数据帧-pyspark，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49516581/

上一篇：python - 在 alpine docker 镜像中使用 asyncpg python 模块

下一篇：node.js - 在Dockerfile中安装Node.js和NPM

相关文章：

python - 在python中循环访问字典的元素

python - 从导入的函数打开文件

java - Python 和 Java 之间的 For 循环区别

python - 连接两个 PySpark 数据帧

hadoop - 如何过滤Hadoop结果输出

java - Hadoop 中的 Mapper 输出保存在哪里？

lucene - 使用 Lucene 的 Hive

python - Python 3.4 是否向后兼容 2.7 程序/库？

mapreduce - 如何限制每个执行器并发map任务的数量？

java - (Jacoco) 传送到 Apache Spark 的代码的代码覆盖率？