apache-spark - 如何将Row类型转换为Vector以馈给KMeans

标签 apache-spark pyspark k-means apache-spark-mllib pyspark-sql

当我尝试将df2馈送到kmeans时，出现以下错误

clusters = KMeans.train(df2, 10, maxIterations=30,
                        runs=10, initializationMode="random")

我得到的错误:

Cannot convert type <class 'pyspark.sql.types.Row'> into Vector

df2是如下创建的数据框:

df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')

df2.show()


     latitude|       longitude|

   60.1643075|      24.9460844|
   60.4686748|      22.2774728|

我如何将这两列转换为Vector并将其提供给KMeans？

最佳答案

ML

问题是您错过了documentation's example，并且很明显，train方法需要以DataFrame为特征的Vector。

要修改当前数据的结构，可以使用VectorAssembler。在您的情况下，可能是这样的:

from pyspark.sql.functions import *

vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
                                  outputCol="features")

# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c) 
        for c in vectorAssembler.getInputCols()]

df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)

此外，您还应该使用MinMaxScaler类标准化features以获得更好的结果。

多层板

为了使用MLLib实现此目的，您需要首先使用map函数，将所有string值转换为Double，然后将它们合并在一起成为DenseVector。

rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))

之后，您可以使用rdd变量训练MLlib's KMeans model。

关于apache-spark - 如何将Row类型转换为Vector以馈给KMeans，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36142973/

上一篇：Org-mode -- 在列 View 中汇总多个层次结构的工作量

下一篇：maven - 如何在所有模块构建完成后执行maven插件

caching - Spark 缓存是否会以任何时间间隔自动更新数据的新值？

python - 学习 : Mean Distance from Centroid of each cluster

apache-spark - 在 pyspark 数据框中复制一列

scala - 如何将包含 (vertexId,edgeId) 的 Map 转换为 GraphX RDD

java - 根据 DataStax Enterprise 的运行时类路径构建 Spark 应用程序

scala - 在 Scala Spark 中的列上使用指数如何使其工作

python - 在 PYSPARK 中运行 collect() 时出错

python - 如何使用 scikit-learn 获取每个 k-means 集群的惯性值？

python - 计算多个时间序列平均值的快速方法？