python-3.x - 使用机器学习算法时无法将字符串转换为 float (python 3) (Anaconda)

标签 python-3.x machine-learning scikit-learn anaconda

我目前正在观看有关针对 KDD 99 cup 数据集使用机器学习算法的视频。运行下面的代码时,我收到一条错误消息“无法将字符串转换为 float ‘正常’”。‘正常’是在下面所示的 Y 功能集中找到的标签之一。当我测试算法仅预测 3 个 y 特征(正常、蓝 Sprite 和海王星)时,y 特征集有 23 个标签,它工作得很好,但一旦我尝试让它根据所有标签进行预测,我就会收到错误。 任何指导将不胜感激,因为我已经为此工作了 2 天。

feature_cols =['duration','src_bytes','dst_bytes','land',
   'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',
   'num_compromised', 'root_shell', 'su_attempted', 'num_root',
   'num_file_creations', 'num_shells', 'num_access_files',
   'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',
   'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',
   'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
   'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
   'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
   'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
   'dst_host_serror_rate', 'dst_host_srv_serror_rate',
   'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label',
   'proto__icmp', 'proto__tcp', 'proto__udp']

x = dataset[feature_cols]
y = dataset.label

y.value_counts(normalize=True)

Y 特征标签

smurf.          
neptune.  
normal.   
back.    
satan.     
ipsweep.     
portsweep.    
warezclient.   
teardrop.    
pod.        
nmap.     
guess_passwd.
buffer_overflow.
land.
warezmaster.
imap.
rootkit.
loadmodule.
ftp_write.
multihop.
phf.
perl.
spy.
Name: label, dtype: float64

代码和错误

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)
print (scores)
print ("Accuracy: %2.10f" % np.mean(scores))

ValueError                                Traceback (most recent call last)
<ipython-input-70-722f95b657f5> in <module>()
      1 from sklearn.tree import DecisionTreeClassifier
      2 dt = DecisionTreeClassifier()
----> 3 scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)
      4 print (scores)
      5 print ("Accuracy: %2.10f" % np.mean(scores))

~\Anaconda3\lib\site-packages\sklearn\cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1579                                               train, test, verbose, None,
   1580                                               fit_params)
-> 1581                       for train, test in cv)
   1582     return np.array(scores)[:, 0]
   1583 

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
    777             # was dispatched. In particular this covers the edge
    778             # case of Parallel used with an exhausted iterator.
--> 779             while self.dispatch_one_batch(iterator):
    780                 self._iterating = True
    781             else:

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
    623                 return False
    624             else:
--> 625                 self._dispatch(tasks)
    626                 return True
    627 

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
    586         dispatch_timestamp = time.time()
    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588         job = self._backend.apply_async(batch, callback=cb)
    589         self._jobs.append(job)
    590 

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
    109     def apply_async(self, func, callback=None):
    110         """Schedule a func to be run"""
--> 111         result = ImmediateResult(func)
    112         if callback:
    113             callback(result)

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
    330         # Don't delay the application, to avoid keeping the input
    331         # arguments in memory
--> 332         self.results = batch()
    333 
    334     def get(self):

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1673             estimator.fit(X_train, **fit_params)
   1674         else:
-> 1675             estimator.fit(X_train, y_train, **fit_params)
   1676 
   1677     except Exception as e:

~\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    788             sample_weight=sample_weight,
    789             check_input=check_input,
--> 790             X_idx_sorted=X_idx_sorted)
    791         return self
    792 

~\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    114         random_state = check_random_state(self.random_state)
    115         if check_input:
--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    117             y = check_array(y, ensure_2d=False, dtype=None)
    118             if issparse(X):

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

ValueError: could not convert string to float: 'normal.'

按要求提供完整代码

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
dataset = pd.read_csv('../data/kddcup.data', header=None, names=col_names)

# Warning, takes a while to load

# make dummy variables for protocol type

protocol_dummies = pd.get_dummies(dataset['protocol_type'], prefix='proto_')

# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
dataset = pd.concat([dataset, protocol_dummies], axis=1)

del dataset['protocol_type']

x = dataset.drop(['label'], axis=1)
y = dataset.label

from sklearn.cross_validation import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.cross_validation import train_test_split
from datetime import datetime

feature_cols =['duration','src_bytes','dst_bytes','land',
       'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',
       'num_compromised', 'root_shell', 'su_attempted', 'num_root',
       'num_file_creations', 'num_shells', 'num_access_files',
       'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',
       'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',
       'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
       'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
       'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
       'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
       'dst_host_serror_rate', 'dst_host_srv_serror_rate',
       'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label',
       'proto__icmp', 'proto__tcp', 'proto__udp']

x = dataset[feature_cols]
y = dataset.label

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
scores = cross_val_score(dt, x, y, scoring='accuracy', cv=10)
print (scores)
print ("Accuracy: %2.10f" % np.mean(scores))

kdd 数据集中的一行

0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8 ,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,正常。

最佳答案

我刚刚意识到我在 x 功能中留下了标签列。我已经把它拿出来了,现在可以用了。

关于python-3.x - 使用机器学习算法时无法将字符串转换为 float (python 3) (Anaconda),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50023892/

相关文章:

python - 如何从包含 <span> 和美丽汤的 <div> 获取文本

python - 如何在 Python3 中找到按数字顺序排列的列表中的数字?

regex - Python读取带有开始和停止条件的文件

tensorflow - 何时使用 @tf.function 装饰器,何时不使用?我知道 tf.function 构建图形。但是如何知道何时构建图呢?

python - tensorflow 提要列表功能(多热)到 tf.estimator

python - scipy.optimize 约束最小化 SLSQP - 无法 100% 匹配目标

python - SMOTETomek - 如何将比率设置为固定余额的字典

tensorflow - 如何使用 tensorflow 服务为pytorch或sklearn模型提供服务

python - 快速正则表达式 Python 3 不起作用

machine-learning - 如何构建决策树回归模型