numpy - Spark 随机森林 - 无法将 float 转换为 int 错误

标签 numpy machine-learning pyspark random-forest apache-spark-ml

我有数字和二进制响应的特征。我正在尝试构建集成决策树,例如随机森林和梯度增强树。但是,我收到错误。我已经用虹膜数据重现了该错误。 错误如下,整个错误消息位于底部。

TypeError: Could not convert 12.631578947368421 to int

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
y = list(iris.target)
df = pd.read_csv("https://raw.githubusercontent.com/venky14/Machine- Learning-with-Iris-Dataset/master/Iris.csv")
df = df.drop(['Species'], axis = 1)
df['label'] = y
spark_df = spark.createDataFrame(df).drop('Id')
cols = spark_df.drop('label').columns
assembler = VectorAssembler(inputCols = cols, outputCol = 'features')
output_dat = assembler.transform(spark_df).select('label', 'features')

rf = RandomForestClassifier(labelCol = "label", featuresCol = "features")
paramGrid_rf = ParamGridBuilder() \
                     .addGrid(rf.maxDepth, np.linspace(5, 30, 6)) \
                     .addGrid(rf.numTrees, np.linspace(10, 60, 20)).build()

crossval_rf = CrossValidator(estimator = rf,
                         estimatorParamMaps = paramGrid_rf,
                         evaluator = BinaryClassificationEvaluator(),
                         numFolds = 5) 

cvModel_rf = crossval_rf.fit(output_dat)

TypeError                                 Traceback (most recent call last)
<ipython-input-24-44f8f759ed8e> in <module>
      2 paramGrid_rf = ParamGridBuilder() \
      3    .addGrid(rf.maxDepth, np.linspace(5, 30, 6)) \
----> 4    .addGrid(rf.numTrees, np.linspace(10, 60, 20)) \
      5    .build()
      6 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in build(self)
    120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
--> 122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
    123 
    124 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in <listcomp>(.0)
    120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
--> 122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
    123 
    124 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in to_key_value_pairs(keys, values)
    118 
    119         def to_key_value_pairs(keys, values):
--> 120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
    122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in <listcomp>(.0)
    118 
    119         def to_key_value_pairs(keys, values):
--> 120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
    122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/param/__init__.py in toInt(value)
    197             return int(value)
    198         else:
--> 199             raise TypeError("Could not convert %s to int" % value)
    200 
    201     @staticmethod

TypeError: Could not convert 12.631578947368421 to int```

最佳答案

maxDepthnumTrees 都需要是整数; Numpy linspace 产生 float :

import numpy as np
np.linspace(10, 60, 20)

结果:

array([ 10.        ,  12.63157895,  15.26315789,  17.89473684,
        20.52631579,  23.15789474,  25.78947368,  28.42105263,
        31.05263158,  33.68421053,  36.31578947,  38.94736842,
        41.57894737,  44.21052632,  46.84210526,  49.47368421,
        52.10526316,  54.73684211,  57.36842105,  60.        ])

因此,您的代码遇到第一个非整数值(此处为 12.63157895),并产生错误。

使用arange相反:

np.arange(10, 60, 20)
# array([10, 30, 50])

关于numpy - Spark 随机森林 - 无法将 float 转换为 int 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55283267/

相关文章:

python - 有没有比我做的更好的方法来猜测可能的未知变量而不用蛮力?机器学习?

python - 在 Windows 操作系统中的 (Anaconda - Spyder) 中运行 pyspark

python - 我如何测试这个功能?

python - 如何修复 Tensorflow 中的 "IndexError: list index out of range"

python - 无法卸载 'numpy'

python - 最近的交叉点到 python 中的多条线

python - 使用多个条件删除 pandas 中的重复行

machine-learning - Word2Vec 和 Glove 向量适合实体识别吗?

python - 我无法安装Opencv

python - 在数据框的列上应用 map 功能