我使用以下代码将数据集拆分为训练/验证/测试集。
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val =
train_test_split(X_test, y_test, test_size=0.5, random_state=42)
问题是我的数据集确实不平衡。例如,有些类(class)有 500 个样本,有些则有 70 个。在这种情况下这种分割方法准确吗?采样是随机的还是sklearn使用seome方法来保持所有集合中数据的分布相同?
最佳答案
您应该使用stratify
选项(请参阅docs):
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3, random_state=42, stratify=y_data)
X_test, X_val, y_test, y_val =
train_test_split(X_test, y_test, test_size=0.5, random_state=42, stratify=y_test)
关于python - 使用sklearn.model_selection分割不平衡数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56024117/