Python(Pyspark)嵌套列表reduceByKey，Python列表追加创建嵌套列表

我有一个 RDD 输入，其格式如下:

[('2002', ['cougar', 1]),
('2002', ['the', 10]),
('2002', ['network', 4]),
('2002', ['is', 1]),
('2002', ['database', 13])]

“2002”是关键。所以，我的键值对如下:

 ('year', ['word', count])

Count是整数，我想使用reduceByKey得到以下结果:

[('2002, [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]]')]

我花了很多功夫才得到上面的巢列表。主要问题是获取嵌套列表。例如。我有三个列表 a、b 和 c

a = ['cougar', 1]
b = ['the', 10]
c = ['network', 4]

a.append(b)

将返回 a 作为

 ['cougar', 1, ['the', 10]]

和

x = []
x.append(a)
x.append(b)

将返回x作为

  [['cougar', 1], ['the', 10]]

但是，如果那么

  c.append(x)

将返回 c 作为

  ['network', 4, [['cougar', 1], ['the', 10]]]

以上所有操作都没有得到我想要的结果。

我想要得到

   [('2002', [[word1, c1],[word2, c2], [word3, c3], ...]), 
   ('2003'[[w1, count1],[w2, count2], [w3, count3], ...])]

即嵌套列表应该是:

  [a, b, c]

其中 a、b、c 本身是包含两个元素的列表。

我希望问题很清楚，有什么建议吗？

最佳答案

这个问题不需要使用ReduceByKey。

rdd = sc.parallelize([('2002', ['美洲狮', 1]),('2002', ['the', 10]),('2002', ['网络' , 4]),('2002', ['是', 1]),('2002', ['数据库', 13])])

[('2002', ['美洲狮', 1]), ('2002', ['the', 10]), ('2002', ['网络', 4]), ( '2002', ['是', 1]), ('2002', ['数据库', 13])]

rdd_nested = rdd.groupByKey().mapValues(列表)

[('2002', [['美洲狮', 1], ['the', 10], ['网络', 4], ['是', 1], ['数据库', 13]])]

关于Python(Pyspark)嵌套列表reduceByKey，Python列表追加创建嵌套列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53696489/