我收到一条错误消息,指出“数组包含 NaN 或无穷大”。我已经检查了我的数据,包括训练/测试缺失值,没有遗漏任何东西。
我可能对“数组包含 NaN 或无穷大”的含义有错误的解释。
import numpy as np
from sklearn import linear_model
from numpy import genfromtxt, savetxt
def main():
#create the training & test sets, skipping the header row with [1:]
dataset = genfromtxt(open('C:\\Users\\Owner\\training.csv','r'), delimiter=',')[0:50]
target = [x[0] for x in dataset]
train = [x[1:50] for x in dataset]
test = genfromtxt(open('C:\\Users\\Owner\\test.csv','r'), delimiter=',')[0:50]
#create and train the SGD
sgd = linear_model.SGDClassifier()
sgd.fit(train, target)
predictions = [x[1] for x in sgd.predict(test)]
savetxt('C:\\Users\\Owner\\Desktop\\preds.csv', predictions, delimiter=',', fmt='%f')
if __name__=="__main__":
main()
我认为数据类型可能会引发循环算法(它们是 float )。
我知道 SGD 可以处理 float ,所以我不确定此设置是否要求我声明数据类型。
例如以下之一:
>>> dt = np.dtype('i4') # 32-bit signed integer
>>> dt = np.dtype('f8') # 64-bit floating-point number
>>> dt = np.dtype('c16') # 128-bit complex floating-point number
>>> dt = np.dtype('a25') # 25-character string
下面是完整的错误信息:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-62-af5537e7802b> in <module>()
19
20 if __name__=="__main__":
---> 21 main()
<ipython-input-62-af5537e7802b> in main()
13 #create and train the SGD
14 sgd = linear_model.SGDClassifier()
---> 15 sgd.fit(train, target)
16 predictions = [x[1] for x in sgd.predict(test)]
17
C:\Anaconda\lib\site-packages\sklearn\linear_model\stochastic_gradient.pyc in fi
t(self, X, y, coef_init, intercept_init, class_weight, sample_weight)
518 coef_init=coef_init, intercept_init=intercept_i
nit,
519 class_weight=class_weight,
--> 520 sample_weight=sample_weight)
521
522
C:\Anaconda\lib\site-packages\sklearn\linear_model\stochastic_gradient.pyc in _f
it(self, X, y, alpha, C, loss, learning_rate, coef_init, intercept_init, class_w
eight, sample_weight)
397 self.class_weight = class_weight
398
--> 399 X = atleast2d_or_csr(X, dtype=np.float64, order="C")
400 n_samples, n_features = X.shape
401
C:\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in atleast2d_or_csr(X
, dtype, order, copy)
114 """
115 return _atleast2d_or_sparse(X, dtype, order, copy, sparse.csr_matrix
,
--> 116 "tocsr")
117
118
C:\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in _atleast2d_or_spar
se(X, dtype, order, copy, sparse_class, convmethod)
94 _assert_all_finite(X.data)
95 else:
---> 96 X = array2d(X, dtype=dtype, order=order, copy=copy)
97 _assert_all_finite(X)
98 return X
C:\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in array2d(X, dtype,
order, copy)
79 'is required. Use X.toarray() to convert to dens
e.')
80 X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
---> 81 _assert_all_finite(X_2d)
82 if X is X_2d and copy:
83 X_2d = safe_copy(X_2d)
C:\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in _assert_all_finite
(X)
16 if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.s
um())
17 and not np.isfinite(X).all()):
---> 18 raise ValueError("Array contains NaN or infinity.")
19
20
ValueError: Array contains NaN or infinity.
如有任何想法,我们将不胜感激。
最佳答案
作为错误报告,您的数据中某处有 np.nan
或 np.inf
或 -np.inf
。由于您正在从文本文件中读取并且您说您的数据不包含缺失值,这可能是由列标题或文件中无法自动解释的其他一些值引起的。 p>
genfromtxt
的文档显示,读入数组的默认dtype
是float
,这意味着您读取的所有值都必须通过相当于 float(x)
。
如果您不确定这是否是导致错误的原因,您可以按如下方式从 numpy 数组中删除非有限数字:
dataset[ ~np.isfinite(dataset) ] = 0 # Set non-finite (nan, inf, -inf) to zero
如果这消除了错误,您可以确定您的变量中有无效值,某处。要查找位置,您可以使用以下命令:
np.where(~np.isfinite(dataset))
这将返回无效值所在的索引列表,例如
>>> import numpy as np
>>> dataset = np.array([[0,1,1],[np.nan,0,0],[1,2,np.inf]])
>>> dataset
array([[ 0., 1., 1.],
[ nan, 0., 0.],
[ 1., 2., inf]])
>>> np.where(~np.isfinite(dataset))
(array([1, 2]), array([0, 2]))
关于python - Sci-Kit 学习 SGD 算法时出错 - "Array contains NaN or infinity",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18598988/