python - SVM 内核的速度？线性与 RBF 与多边形

我在 Python 中使用 scikitlearn 创建一些 SVM 模型，同时尝试不同的内核。代码非常简单，遵循以下形式:

from sklearn import svm
clf = svm.SVC(kernel='rbf', C=1, gamma=0.1) 
clf = svm.SVC(kernel='linear', C=1, gamma=0.1) 
clf = svm.SVC(kernel='poly', C=1, gamma=0.1) 
t0 = time()
clf.fit(X_train, y_train)
print "Training time:", round(time() - t0, 3), "s"
pred = clf.predict(X_test)

数据是 8 个特征和 3000 多个观察值。我惊讶地发现 rbf 不到一秒就安装完毕，而 linear 需要 90 秒，而 poly 需要几个小时。

我假设非线性内核会更复杂并且需要更多时间。线性比 rbf 花费的时间长得多，而 poly 比两者花费的时间长得多，这是有原因的吗？它会根据我的数据有很大差异吗？

最佳答案

您是否扩展了数据？

这可能成为 SVM 的问题。根据A Practical Guide to Support Vector Classification

Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical problems.

现在举个例子，我将使用 sklearn 乳腺癌数据集:

from time import time

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

clf_lin = SVC(kernel='linear', C=1.0, gamma=0.1)
clf_rbf = SVC(kernerl='rbf', C=1.0, gamma=0.1)

start = time()
clf_lin.fit(X_train, y_train)
print("Linear Kernel Non-Normalized Fit Time: {0.4f} s".format(time() - start))
start = time()
clf_rbf.fit(X_train, y_train)
print("RBF Kernel Non-Normalized Fit Time: {0.4f} s".format(time() - start))

scaler = MinMaxScaler()  # Default behavior is to scale to [0,1]
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y)

start = time()
clf_lin.fit(X_train, y_train)
print("Linear Kernel Normalized Fit Time: {0.4f} s".format(time() - start))
start = time()
clf_rbf.fit(X_train, y_train)
print("RBF Kernel Normalized Fit Time: {0.4f} s".format(time() - start))

输出:

Linear Kernel Non-Normalized Fit Time: 0.8672
RBF Kernel Non-Normalized Fit Time: 0.0124
Linear Kernel Normalized Fit Time: 0.0021
RBF Kernel Normalized Fit Time: 0.0039

因此您可以看到，在这个形状为 (560, 30) 的数据集中，我们通过一点点缩放获得了相当显着的性能提升。

此行为取决于具有较大值的特征。考虑在无限维空间中工作。随着您填充无限维空间的值越来越大，它们的多维产品之间的空间变得很多更大。我再强调很多也不为过。了解 The Curse of Dimensionality ，并且不仅仅阅读我链接的 wiki 条目。这个间隔使这个过程需要更长的时间。试图在这个巨大的空间中分离类别背后的数学变得更加复杂，尤其是随着特征和观察数量的增长。因此，始终扩展数据至关重要。即使你只是在做一个简单的线性回归，这也是一个很好的做法，因为你会消除对具有较大值的特征的任何可能的偏见。

关于python - SVM 内核的速度？线性与 RBF 与多边形，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43529388/

python - SVM 内核的速度？线性与 RBF 与多边形

上一篇：python - gspread import_csv file_id 参数是什么？

下一篇：python - PyCharm Python 控制台 - 在同一行上打印未按预期工作