scala - 在 Apache Spark 中使用 RowMatrix.columnSimilarities 后打印 CooperativeMatrix

我正在将 Spark mllib 用于我需要计算文档相似度的项目之一。

我首先使用mllib的tf-idf变换将文档转换为向量，然后将其转换为RowMatrix并使用columnSimilarities()方法。

我提到了tf-idf文档并使用 DIMSUM余弦相似度的实现。

在spark-shell中，这是执行的scala代码:

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()

val tf = hashingTF.transform(documents)
tf.cache()

val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm

val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix

现在假设我的输入文件，此代码块中的test1是一个简单文件，其中包含 5 个简短文档(每个文档少于 10 个术语)，每行一个。

由于我只是测试此代码，因此我想查看对象 sim 中的 mat.columnSimilarities() 的输出。我想看看第一个文档向量与第二个、第三个等的相似性。

我提到了spark documentation对于 CooperativeMatrix 来说，它是由 RowMatrix 类的 columnSimilarities 方法返回并由 sim 引用的对象类型。

通过查看更多文档，我想我可以将 CooperativeMatrix 转换为 RowMatrix，然后将 RowMatrix 的行转换为数组，然后像这样打印 println(sim.toRowMatrix().rows.toArray().mkString ("\n")) .

但这给出了一些我无法理解的输出。

有人可以帮忙吗？任何类型的资源链接等都会有很大帮助!

谢谢!

最佳答案

您可以尝试以下方法，无需转换为行矩阵格式

val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}

要检索元素，您可以调用以下操作

transformedRDD.collect()

关于scala - 在 Apache Spark 中使用 RowMatrix.columnSimilarities 后打印 CooperativeMatrix，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36254959/

scala - 在 Apache Spark 中使用 RowMatrix.columnSimilarities 后打印 CooperativeMatrix

上一篇：matlab - Matlab/Octave 中的两个求和计算中哪一个对于行向量是最佳的？

下一篇：Verilog 代码将进行模拟，但不会进行综合。