python - Pyspark 中的中位数和分位数值

标签 python apache-spark pyspark apache-spark-sql

在我的数据框中，我有一个年龄列。总行数约为 770 亿行。我想使用 PySpark 计算该列的分位数值。我有一些代码，但计算时间很大(也许我的过程很糟糕)。

有什么好的办法可以改善这个吗？

数据框示例:

id       age
1         18
2         32
3         54
4         63
5         42
6         23

到目前为止我做了什么:

#Summary stats
df.describe('age').show()

#For Quantile values
x5 = df.approxQuantile("age", [0.5], 0)
x25 = df.approxQuantile("age", [0.25], 0)
x75 = df.approxQuantile("age", [0.75], 0)

最佳答案

要做的第一个改进是同时进行所有分位数计算:

quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0)

另请注意，您使用的是分位数的精确计算。来自 documentation我们可以看到(重点是我添加的):

relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.

由于您有一个非常大的数据帧，我希望这些计算中存在一些错误是可以接受的，但这将是速度和精度之间的权衡(尽管任何大于 0 的值都可能会显着提高速度)。

关于python - Pyspark 中的中位数和分位数值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56159900/

上一篇：python - 如何可视化单词模式？

下一篇：python - 尝试将一列中的数据从字符串转换为整数

python - Pyspark drop_duplicates(keep=False)

python-3.x - PySpark-如何使用 Pyspark 计算每个字段的最小值、最大值？

python - 如何在 python 中从 LAB (l*a*b) 颜色空间中获取 a channel

python - 如何将 'JpegImageFile' 转换为字符串以便通过套接字发送

apache-spark - TIMESTAMP 在 hive 中与 Parquet 的行为不符

java - 我们如何从数据帧在 scala 中创建嵌套数组？

redis - Spark : How to send arguments to Spark foreach function

python - 有条件地合并列表

python - django 缓存 session 的登录问题