python - Python中基于中值的线性回归

我想通过最小化中位数绝对误差来执行一维线性回归。

最初假设它应该是一个相当标准的用例，但快速搜索令人惊讶地发现，所有回归和内插函数都使用均方误差。

因此，我的问题是：是否存在可以对一维进行基于中值误差的线性回归的函数？

最佳答案

正如评论中已经指出的那样，即使您自己要求的是定义明确的，但解决方案的正确方法也将取决于模型的属性。让我们看看为什么，让我们看看通才优化方法能使您走多远，让我们看看一些数学如何简化问题。底部包含可复制复制的解决方案。

首先，在应用专门算法的意义上，最小二乘拟合比您尝试做的“容易”。例如，SciPy的leastsq使用Levenberg--Marquardt algorithm假定您的optimization objective是平方和。当然，在线性回归的特殊情况下，问题也可能是solved analytically。

除了实际优势之外，最小二乘线性回归理论上也可以证明是合理的：如果观测值的residuals是independent和normally distributed（如果您发现central limit theorem适用于模型，则可以证明是合理的），那么模型参数的maximum likelihood estimate将是通过最小二乘法获得的参数。同样，最小化mean absolute error优化目标的参数将是Laplace distributed残差的最大似然估计。现在，如果您事先知道数据非常脏，以至于对残差的正态性的假设将失败，那么您要尝试的操作将比普通最小二乘法具有优势，但是即使那样，您仍然可以证明会影响到残差正态性的其他假设。选择目标函数，所以我很好奇您在这种情况下最终会如何？

使用数值方法

顺便说一句，一些一般性的评论适用。首先，请注意，SciPy带有large selection of general purpose algorithms，您可以直接将其应用于您的案例。例如，让我们看看如何在单变量情况下应用minimize。

# Generate some data
np.random.seed(0)
n = 200
xs = np.arange(n)
ys = 2*xs + 3 + np.random.normal(0, 30, n)

# Define the optimization objective
def f(theta):
    return np.median(np.abs(theta[1]*xs + theta[0] - ys))

# Provide a poor, but not terrible, initial guess to challenge SciPy a bit
initial_theta = [10, 5]
res = minimize(f, initial_theta)

# Plot the results
plt.scatter(xs, ys, s=1)
plt.plot(res.x[1]*xs + res.x[0])

这样肯定会更糟。正如@sascha在评论中指出的那样，目标的不平滑很快成为一个问题，但是，再次取决于模型的模样，您可能会发现自己正在寻找的convex足以节省您的钱。

如果您的参数空间是低维的，则只需绘制优化范围即可直观地了解优化的稳定性。

theta0s = np.linspace(-100, 100, 200)
theta1s = np.linspace(-5, 5, 200)
costs = [[f([theta0, theta1]) for theta0 in theta0s] for theta1 in theta1s]
plt.contour(theta0s, theta1s, costs, 50)
plt.xlabel('$\\theta_0$')
plt.ylabel('$\\theta_1$')
plt.colorbar()

在上面的特定示例中，如果初始猜测无效，则通用优化算法将失败。

initial_theta = [10, 10000]
res = minimize(f, initial_theta)
plt.scatter(xs, ys, s=1)
plt.plot(res.x[1]*xs + res.x[0])

还请注意，SciPy的许多算法都受益于目标的Jacobian，即使您的目标是不可区分的，再次取决于您要优化的内容，残差也很可能会因此而产生，您的目标可能是可区分的almost everywhere，因为您能够提供导数（例如，中位数的导数变成值为中位数的函数的导数）。

在我们的例子中，提供雅可比行列似乎没有什么特别的帮助，如以下示例所示；在这里，我们增加了残差的方差，足以使整个事物崩溃。

np.random.seed(0)
n = 201
xs = np.arange(n)
ys = 2*xs + 3 + np.random.normal(0, 50, n)
initial_theta = [10, 5]
res = minimize(f, initial_theta)
plt.scatter(xs, ys, s=1)
plt.plot(res.x[1]*xs + res.x[0])

def fder(theta):
    """Calculates the gradient of f."""
    residuals = theta[1]*xs + theta[0] - ys
    absresiduals = np.abs(residuals)
    # Note that np.median potentially interpolates, in which case the np.where below
    # would be empty. Luckily, we chose n to be odd.
    argmedian = np.where(absresiduals == np.median(absresiduals))[0][0]
    residual = residuals[argmedian]
    sign = np.sign(residual)
    return np.array([sign, sign * xs[argmedian]])

res = minimize(f, initial_theta, jac=fder)
plt.scatter(xs, ys, s=1)
plt.plot(res.x[1]*xs + res.x[0])

在此示例中，我们发现自己陷入了奇异之处。

theta = res.x
delta = 0.01
theta0s = np.linspace(theta[0]-delta, theta[0]+delta, 200)
theta1s = np.linspace(theta[1]-delta, theta[1]+delta, 200)
costs = [[f([theta0, theta1]) for theta0 in theta0s] for theta1 in theta1s]

plt.contour(theta0s, theta1s, costs, 100)
plt.xlabel('$\\theta_0$')
plt.ylabel('$\\theta_1$')
plt.colorbar()

而且，这是您在最低限度附近会发现的一团糟：

theta0s = np.linspace(-20, 30, 300)
theta1s = np.linspace(1, 3, 300)
costs = [[f([theta0, theta1]) for theta0 in theta0s] for theta1 in theta1s]

plt.contour(theta0s, theta1s, costs, 50)
plt.xlabel('$\\theta_0$')
plt.ylabel('$\\theta_1$')
plt.colorbar()

如果您发现自己在这里，则可能需要使用其他方法。如@sascha所述，仍然使用通用优化方法的示例包括用更简单的方法替换目标。另一个简单的示例是使用各种不同的初始输入来运行优化：

min_f = float('inf')
for _ in range(100):
    initial_theta = np.random.uniform(-10, 10, 2)
    res = minimize(f, initial_theta, jac=fder)
    if res.fun < min_f:
        min_f = res.fun
        theta = res.x
plt.scatter(xs, ys, s=1)
plt.plot(theta[1]*xs + theta[0])

部分分析方法

请注意，将theta最小化f的值还将最小化残差平方的中值。搜索“最小中位数平方”可能会为您提供有关此特定问题的更多相关信息。

在这里，我们遵循Rousseeuw -- Least Median of Squares Regression，其第二部分包括一种算法，该算法可将上述二维优化问题简化为可能更易于解决的一维问题。如上所述，假设我们有奇数个数据点，因此我们不必担心中位数定义的歧义。

首先要注意的是，如果您只有一个变量（对问题的二读实际上可能是您感兴趣的情况），那么很容易证明以下函数可以提供最小的分析能力。。

def least_median_abs_1d(x: np.ndarray):
    X = np.sort(x)  # For performance, precompute this one.
    h = len(X)//2
    diffs = X[h:] - X[:h+1]
    min_i = np.argmin(diffs)
    return diffs[min_i]/2 + X[min_i]

现在，诀窍在于，对于固定的theta1，通过将以上内容应用于theta0可获得f(theta0, theta1)最小化ys - theta0*xs的值。换句话说，我们已将问题简化为单个变量的函数（以下称为g）的最小化。

def best_theta0(theta1):
    # Here we use the data points defined above
    rs = ys - theta1*xs
    return least_median_abs_1d(rs)

def g(theta1):
    return f([best_theta0(theta1), theta1])

尽管这可能比上面的二维优化问题更容易受到攻击，但是我们还没有完全脱离森林，因为此新功能具有其自身的局部最小值：

theta1s = np.linspace(0, 3, 500)
plt.plot(theta1s, [g(theta1) for theta1 in theta1s])

One-dimensional reduction of the optimization problem

theta1s = np.linspace(1.5, 2.5, 500)
plt.plot(theta1s, [g(theta1) for theta1 in theta1s])

在我有限的测试中，basinhopping似乎能够一致地确定最小值。

from scipy.optimize import basinhopping
res = basinhopping(g, -10)
print(res.x)  # prints [ 1.72529806]

此时，我们可以将所有内容包装起来，并检查结果看起来是否合理：

def least_median(xs, ys, guess_theta1):
    def least_median_abs_1d(x: np.ndarray):
        X = np.sort(x)
        h = len(X)//2
        diffs = X[h:] - X[:h+1]
        min_i = np.argmin(diffs)
        return diffs[min_i]/2 + X[min_i]

    def best_median(theta1):
        rs = ys - theta1*xs
        theta0 = least_median_abs_1d(rs)
        return np.median(np.abs(rs - theta0))

    res = basinhopping(best_median, guess_theta1)
    theta1 = res.x[0]
    theta0 = least_median_abs_1d(ys - theta1*xs)
    return np.array([theta0, theta1]), res.fun

theta, med = least_median(xs, ys, 10)
# Use different colors for the sets of points within and outside the median error
active = ((ys < theta[1]*xs + theta[0] + med) & (ys > theta[1]*xs + theta[0] - med))
not_active = np.logical_not(active)
plt.plot(xs[not_active], ys[not_active], 'g.')
plt.plot(xs[active], ys[active], 'r.')
plt.plot(xs, theta[1]*xs + theta[0], 'b')
plt.plot(xs, theta[1]*xs + theta[0] + med, 'b--')
plt.plot(xs, theta[1]*xs + theta[0] - med, 'b--')

关于python - Python中基于中值的线性回归，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48013201/

python - Python中基于中值的线性回归

上一篇：python - 格式化原始字符串 Python

下一篇：python - 如何在列表中的最后一个单词之前添加一个单词？