python - Spark 中的无序集或类似集?

标签 python algorithm apache-spark data-structures bigdata

我有这种格式的数据:

(123456, (43, 4861))

(000456, (43, 4861))

其中第一项是点 id,第二项是一对,其中第一个 id 是一个簇质心,第二个 id 是另一个簇质心。也就是说,点 123456 被分配给了簇 43 和 4861。

我想做的是创建这种格式的数据:

(43, [123456, 000456])

(4861, [123456, 000456])

想法是每个质心都有一个分配给它的点列表。该列表必须的最大长度为 150。

中有什么可以让我的生活更轻松的吗?


我不关心快速访问和排序。我有 100m 个点和 16k 个质心。


这是我用来玩的一些人工数据:

data = []
from random import randint
for i in xrange(0, 10):
    data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
data = sc.parallelize(data)

最佳答案

从你的描述来看(虽然我还是不太明白),这是一个使用 Python 的简单方法:

In [1]: from itertools import groupby

In [2]: from random import randint

In [3]: data = []  # create random samples as you did
   ...: for i in range(10):
   ...:     data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
   ...:

In [4]: result = []  # create a intermediate list to transform your sample
   ...: for point_id, cluster in data:
   ...:     for index, c in enumerate(cluster):
                # I made it up following your pattern
   ...:         result.append((c, [point_id, str(index * 100).zfill(3) + str(point_id)[-3:]]))
        # sort the result by point_id as key for grouping
   ...: result = sorted(result, key=lambda x: x[1][0])
   ...:

In [5]: result[:3]
Out[5]:
[(4020, [5002188, '000188']),
 (10983, [5002188, '100188']),
 (10800, [24763401, '000401'])]

In [6]: capped_result = []
        # basically groupby sorted point_id and cap the list max at 150
   ...: for _, g in groupby(result, key=lambda x: x[1][0]):
   ...:     grouped = list(g)[:150]
   ...:     capped_result.extend(grouped)
        # final result will be like
   ...: print(capped_result)
   ...:
[(4020, [5002188, '000188']), (10983, [5002188, '100188']), (10800, [24763401, '000401']), (12965, [24763401, '100401']), (6369, [24924435, '000435']), (429, [24924435, '100435']), (7666, [39240078, '000078']), (2526, [39240078, '100078']), (5260, [47597265, '000265']), (7056, [47597265, '100265']), (2824, [60159219, '000219']), (5730, [60159219, '100219']), (7837, [67208338, '000338']), (12475, [67208338, '100338']), (4897, [80084812, '000812']), (13038, [80084812, '100812']), (2944, [80253323, '000323']), (1922, [80253323, '100323']), (12777, [96811112, '000112']), (5463, [96811112, '100112'])]

当然,这根本没有优化,但会让您抢先一步,了解如何解决这个问题。我希望这会有所帮助。

关于python - Spark 中的无序集或类似集?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39379889/

相关文章:

python:从字典调用构造函数?

python - Dockerfile-python:无法打开文件 '/usr/app/client.py':[Errno 2]没有这样的文件或目录

python - 绘制一个数据框作为许多折线图

algorithm - 叠加微积分和方程排序

python - Virtualenv myenv 不会在 Ubuntu 18.04 上的 exFAT 硬盘上创建虚拟环境

java - Karatsuba 乘法 java 递归代码不起作用?

algorithm - 求解 a^3 + b^4 = c^3 + d^3 最佳运行时间

apache-spark - Spark 2.0 : spark-infotheoretic-feature-selection java. lang.NoSuchMethodError : breeze. linalg.DenseMatrix

scala - NLineInputFormat 在 Spark 中不起作用

apache-spark - VSCode 扩展 Databricks-Connect - 使用 SparkSession