python - pyspark,逻辑回归,如何获得各个特征的系数

标签 python apache-spark pyspark apache-spark-mllib

我是 Spark 的新手,我当前的版本是 1.3.1。我想用 PySpark 实现逻辑回归,所以,我从 Spark Python MLlib 找到了这个例子

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)

# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)

# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

我发现 model 的属性是:

In [21]: model.<TAB>
model.clearThreshold  model.predict         model.weights
model.intercept       model.setThreshold  

如何获得逻辑回归的系数?

最佳答案

如您所见,获取系数的方法是使用 LogisticRegressionModel的属性。

Parameters:

weights – Weights computed for every feature.

intercept – Intercept computed for this model. (Only used in Binary Logistic Regression. In Multinomial Logistic Regression, the intercepts will not be a single value, so the intercepts will be part of the weights.)

numFeatures – the dimension of the features.

numClasses – the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so numClasses will be set to 2.

不要忘记 hθ(x) = 1/exp ^ -(θ0 + θ1 * x1 + ... + θn * xn) 其中 θ0 代表截距[θ1,...,θn] 权重,特征个数为n .

编辑

如您所见,这就是预测的完成方式,您可以查看 LogisticRegressionModel的来源。

def predict(self, x):
    """
    Predict values for a single data point or an RDD of points
    using the model trained.
    """
    if isinstance(x, RDD):
        return x.map(lambda v: self.predict(v))

    x = _convert_to_vector(x)
    if self.numClasses == 2:
        margin = self.weights.dot(x) + self._intercept
        if margin > 0:
            prob = 1 / (1 + exp(-margin))
        else:
            exp_margin = exp(margin)
            prob = exp_margin / (1 + exp_margin)
        if self._threshold is None:
            return prob
        else:
            return 1 if prob > self._threshold else 0
    else:
        best_class = 0
        max_margin = 0.0
        if x.size + 1 == self._dataWithBiasSize:
            for i in range(0, self._numClasses - 1):
                margin = x.dot(self._weightsMatrix[i][0:x.size]) + \
                    self._weightsMatrix[i][x.size]
                if margin > max_margin:
                    max_margin = margin
                    best_class = i + 1
        else:
            for i in range(0, self._numClasses - 1):
                margin = x.dot(self._weightsMatrix[i])
                if margin > max_margin:
                    max_margin = margin
                    best_class = i + 1
        return best_class

关于python - pyspark,逻辑回归,如何获得各个特征的系数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36995214/

相关文章:

python - 我可以让 JSON 加载到 OrderedDict 中吗?

apache-spark - Spark SQL 如何读取 Parquet 分区文件

java - 如何将每个 RDD 分区限制为仅 'n' 条记录?

python - 在 Spark 和 Python 中编写 flatMap 函数

python - 无法将 StructField 与 PySpark 一起使用

apache-spark - Kafka 结构化流 java.lang.NoClassDefFoundError

python - 如何获得一个抽象数据类来传递mypy?

python - 当编码为shift_jis时,使用Python的电子邮件模块解析电子邮件时出错

python - 为什么 Pycharm 生成 .bak 文件

java - 在独立集群上提交 Spark 应用程序