python-3.x - 混淆矩阵获得精度、召回率、f1 分数

标签 python-3.x dataframe pyspark pyspark-sql

我有一个数据框 df。我已经对数据框执行了决策树分类算法。两列是执行算法时的标签和特征。型号叫dtc .如何在 pyspark 中创建混淆矩阵?

dtc = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label')
dtcModel = dtc.fit(train)
predictions = dtcModel.transform(test)
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import MulticlassMetrics

preds = df.select(['label', 'features']) \
                            .df.map(lambda line: (line[1], line[0]))
metrics = MulticlassMetrics(preds)

    # Confusion Matrix
print(metrics.confusionMatrix().toArray())```

最佳答案

在调用 metrics.confusionMatrix().toArray() 之前,您需要转换为 rdd 并映射到元组.
来自 official documentation ,

class pyspark.mllib.evaluation.MulticlassMetrics(predictionAndLabels)[source]

Evaluator for multiclass classification.

Parameters: predictionAndLabels – an RDD of (prediction, label) pairs.


这是一个指导您的示例。
机器学习部分
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType
#Note the differences between ml and mllib, they are two different libraries.

#create a sample data frame
data = [(1.54,3.45,2.56,0),(9.39,8.31,1.34,0),(1.25,3.31,9.87,1),(9.35,5.67,2.49,2),\
        (1.23,4.67,8.91,1),(3.56,9.08,7.45,2),(6.43,2.23,1.19,1),(7.89,5.32,9.08,2)]

cols = ('a','b','c','d')

df = spark.createDataFrame(data, cols)

assembler = VectorAssembler(inputCols=['a','b','c'], outputCol='features')

df_features = assembler.transform(df)

#df.show()

train_data, test_data = df_features.randomSplit([0.6,0.4])

dtc = DecisionTreeClassifier(featuresCol='features',labelCol='d')

dtcModel = dtc.fit(train_data)

predictions = dtcModel.transform(test_data)
评测部分
#important: need to cast to float type, and order by prediction, else it won't work
preds_and_labels = predictions.select(['predictions','d']).withColumn('label', F.col('d').cast(FloatType())).orderBy('prediction')

#select only prediction and label columns
preds_and_labels = preds_and_labels.select(['prediction','label'])

metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))

print(metrics.confusionMatrix().toArray())

关于python-3.x - 混淆矩阵获得精度、召回率、f1 分数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58404845/

相关文章:

python - Groupby 保持组间顺序?以何种方式?

python-3.x - 为什么我的 pytest 测试会在删除 SQLAlchemy 数据库之前挂起?

python - 替换 Pandas 数据框中列的值

python - pandas df 可以有可供选择的列数吗?

html - 通过 Databricks 上传到 Azure Blob 存储时设置内容类型

python - 将文件名添加到 wholeTextFiles 上的 RDD 行

pyspark - pyspark 中的数据类型验证

python - 随机列表选择 : How do I make sure the same item isn't ever repeated twice in a row, 一个接一个?

python - 如何捕获字典中函数的输入参数

python - 在 Pandas 中将数据帧子集为多个数据帧