python - sklearn : how to correlate test data to original data?

标签 python scikit-learn

我正在使用 sklearn train_test_split 函数来分割我的训练和测试数据。分割数据并运行分类器后,我需要能够将特征和标签值追溯到原始数据记录。我怎样才能做到这一点?有没有办法包含某种被分类器忽略的隐藏 id 特征?

import json
import numpy as np
from sklearn.cross_validation import train_test_split

json_data = r"""
[
    { "id": 101, "label": 1, "f1": 1, "f2":2, "f3": 3 },
    { "id": 653, "label": 0, "f1": 2, "f2":7, "f3": 8 },
    { "id": 219, "label": 0, "f1": 4, "f2":9, "f3": 2 },
    { "id": 726, "label": 1, "f1": 6, "f2":1, "f3": 0 },
    { "id": 403, "label": 0, "f1": 1, "f2":5, "f3": 4 }
]"""

data = json.loads(json_data)

feature_names = ["f1", "f2", "f3"]

labels = []
features = []
for item in data:
    temp_list = []
    labels.append(item["label"])
    for feature_name in feature_names:
        temp_list.append(item[feature_name])
    features.append(temp_list)

labels_train, labels_test, features_train, features_test = train_test_split(labels, features, test_size = .20, random_state = 99)

print labels_test
print features_test

## this will give us labels_test = [0], features_test = [[4,9,2]] which corresponds to record with id = 219
## how can I efficiently correlate the split data back to the original records without comparing feature values?

最佳答案

通常我将输入数据存储在 Pandas 数据框中,并使用索引进行训练测试分割;对于您的示例,您可以使用如下内容:

import pandas as pd
test_size=0.2

df = pd.read_json(json_data)
I = np.random.random(len(df)) > test_size

X_train = df[I][feature_names].values
X_test = df[~I][feature_names].values
y_train = df[I]['label'].values
y_test = df[~I]['label'].values

关于python - sklearn : how to correlate test data to original data?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31931896/

相关文章:

python - 打印所有能被 7 整除并包含 7 从 0 到 100 的数字

python - 如何处理python scikit NMF中的缺失值

python - Tensorflow 分类示例期间的 tf.train.get.global_step 错误

machine-learning - Scikit-learn 中的 R-Squared 和terpreted_variance_score 有什么区别?

python - 使用 boto3 和 python 从 amazon s3 读取 zip 文件

Python-3 : Why this following code returns none in print statement?

python - 将列转置为行,将前列的 value_counts 显示为 Pandas 中的列值

python搜索元组列表

python - SVR 预测所有特征的值相同

python - 如何避免重新训练机器学习模型