我有数据集包含字符串列。如何像我们在 scikit-learn LabelEncoder 中所做的那样对基于字符串的列进行编码
最佳答案
StringIndexer 正是您所需要的
https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer
from pyspark.ml.feature import StringIndexer
df = sqlContext.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()
关于apache-spark - 如何在 Apache Spark 中进行 LabelEncoding 或分类值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30580410/