apache-spark - Spark生成发生矩阵

标签 apache-spark apache-spark-sql pyspark-sql

我有如图所示的输入交易

apples,mangos,eggs
milk,oranges,eggs
milk, cereals
mango,apples

我必须像这样生成一个共现矩阵的 Spark 数据帧。

     apple mango  milk cereals  eggs
apple    2     2      0     0       1
mango    2     2      0     0       1
milk     0     0      2     1       1
cereals  0     0      1     1       0
eggs     1     1      1     0       2

苹果和芒果一起买两次，所以矩阵[apple][mango] =2。

我被困在实现这一点的想法中？任何建议都会有很大帮助。我正在使用 PySpark 来实现这一点。

最佳答案

如果数据如下所示:

df = spark.createDataFrame(
    ["apples,mangos,eggs", "milk,oranges,eggs", "milk,cereals", "mangos,apples"],
    "string"
).toDF("basket")

进口

from pyspark.sql.functions import split, explode, monotonically_increasing_id

split 和爆炸:

long = (df
    .withColumn("id", monotonically_increasing_id())
    .select("id", explode(split("basket", ","))))

自连接和 corsstab

long.withColumnRenamed("col", "col_").join(long, ["id"]).stat.crosstab("col_", "col").show()

# +--------+------+-------+----+------+----+-------+
# |col__col|apples|cereals|eggs|mangos|milk|oranges|
# +--------+------+-------+----+------+----+-------+
# | cereals|     0|      1|   0|     0|   1|      0|
# |    eggs|     1|      0|   2|     1|   1|      1|
# |    milk|     0|      1|   1|     0|   2|      1|
# |  mangos|     2|      0|   1|     2|   0|      0|
# |  apples|     2|      0|   1|     2|   0|      0|
# | oranges|     0|      0|   1|     0|   1|      1|
# +--------+------+-------+----+------+----+-------+

关于apache-spark - Spark生成发生矩阵，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48551900/

上一篇：node.js - 我无法连接到 postgres 数据库

下一篇：node.js - Sequelize 支持 SQL Server View 吗？

apache-spark - Spark 2.1中的Spark-kafka集成分配了多少个executor来监听kafka主题？

apache-spark - Spark 缓存 RDD 未显示在 Spark History WebUI 上 - 存储

apache-spark - 将 PySpark 数组列乘以标量

scala - 根据条件总结DataFrame的值

scala - Spark : how to create a row with fields name

scala - 在 Spark Json 到 Csv 转换中？

apache-spark - 在读取 csv 时在 Spark-2.2.0 中使用双引号处理多行数据

pyspark - 如何对pyspark中每个组内的变量进行排序？

apache-spark - PySpark:向 DataFrame 添加更多列的最佳实践