pandas - 分割数据以按条件进行训练和测试

标签 pandas machine-learning scikit-learn

假设我有一个包含贷款信息的 pandas DataFrame,并且我想预测用户不归还资金的概率(由我的数据框中的 default 列指示)。我想使用 sklearn.model_selection.train_test_split 拆分训练集和测试集中的数据。

但是,我想确保具有相同 customerID 的贷款不会同时出现在测试集中和训练集中。我该怎么做?

下面是我的数据示例:

d = {'loan_date': ['20170101','20170701','20170301','20170415','20170515'],
     'customerID': [111,111,222,333,444],
     'loanID': ['aaa','fff','ccc','ddd','bbb'],
     'loan_duration' : [6,3,12,5,12],
     'gender':['F','F','M','F','M'],
     'loan_amount': [20000,10000,30000,10000,40000],
     'default':[0,1,0,0,1]}

df = pd.DataFrame(data=d)
例如,

CustomerID==111 贷款记录应该出现在测试集中或训练集中,但不能同时出现在两者中。

最佳答案

我提出以下解决方案。具有相同 customerID 的客户不会出现在训练和测试中; aslo 客户按其事件划分 - 即,具有相同贷款数量的用户将被置于训练和测试中。

我出于演示目的扩展了数据示例:

d = {'loan_date': ['20170101','20170701','20170301','20170415','20170515','20170905', '20170814', '20170819', '20170304'],         
     'customerID': [111,111,222,333,444,222,111,444,555],        
     'loanID': ['aaa','fff','ccc','ddd','bbb','eee', 'kkk', 'zzz', 'yyy'],                                                         
     'loan_duration' : [6,3,12,5,12, 3, 17, 4, 6],
     'gender':['F','F','M','F','M','M', 'F', 'M','F'],
     'loan_amount': [20000,10000,30000,10000,40000,20000,30000,30000,40000],
     'default':[0,1,0,0,1,0,1,1,0]}

df = pd.DataFrame(data=d) 

代码:

from sklearn.model_selection import train_test_split

def group_customers_by_activity(df):
    value_count = df.customerID.value_counts().reset_index()
    df_by_customer = df.set_index('customerID')
    df_s = [df_by_customer.loc[value_count[value_count.customerID == count]['index']] for count in value_count.customerID.unique()]
    return df_s

- 此函数按 customerID 事件拆分 df(具有相同 customerID 的条目数)。
该函数的示例输出:

group_customers_by_activity(df)
Out:
[           loan_date loanID  loan_duration gender  loan_amount  default
 customerID                                                             
 111         20170101    aaa              6      F        20000        0
 111         20170701    fff              3      F        10000        1
 111         20170814    kkk             17      F        30000        1,
            loan_date loanID  loan_duration gender  loan_amount  default
 customerID                                                             
 222         20170301    ccc             12      M        30000        0
 222         20170905    eee              3      M        20000        0
 444         20170515    bbb             12      M        40000        1
 444         20170819    zzz              4      M        30000        1,
            loan_date loanID  loan_duration gender  loan_amount  default
 customerID                                                             
 333         20170415    ddd              5      F        10000        0
 555         20170304    yyy              6      F        40000        0]

- 拥有 1、2、3 笔贷款等的用户组。

此函数以用户进行训练或测试的方式拆分组:

def split_group(df_group, train_size=0.8):
    customers = df_group.index.unique()
    train_customers, test_customers = train_test_split(customers, train_size=train_size)
    train_df, test_df = df_group.loc[train_customers], df_group.loc[test_customers]
    return train_df, test_df

split_group(df_s[2])
Out:
(           loan_date loanID  loan_duration gender  loan_amount  default
 customerID                                                             
 444         20170515    bbb             12      M        40000        1
 444         20170819    zzz              4      M        30000        1,
            loan_date loanID  loan_duration gender  loan_amount  default
 customerID                                                             
 222         20170301    ccc             12      M        30000        0
 222         20170905    eee              3      M        20000        0)

剩下的就是将其应用于所有“客户事件”组:

def get_sized_splits(df_s, train_size):
    train_splits, test_splits = zip(*[split_group(df_group, train_size) for df_group in df_s])
    return train_splits, test_splits

df_s = group_customers_by_activity(df)
train_splits, test_splits = get_sized_splits(df_s, 0.8)
train_splits, test_splits
Out:
((Empty DataFrame
  Columns: [loan_date, loanID, loan_duration, gender, loan_amount, default]
  Index: [],
             loan_date loanID  loan_duration gender  loan_amount  default
  customerID                                                             
  444         20170515    bbb             12      M        40000        1
  444         20170819    zzz              4      M        30000        1,
             loan_date loanID  loan_duration gender  loan_amount  default
  customerID                                                             
  333         20170415    ddd              5      F        10000        0),
 (           loan_date loanID  loan_duration gender  loan_amount  default
  customerID                                                             
  111         20170101    aaa              6      F        20000        0
  111         20170701    fff              3      F        10000        1
  111         20170814    kkk             17      F        30000        1,
             loan_date loanID  loan_duration gender  loan_amount  default
  customerID                                                             
  222         20170301    ccc             12      M        30000        0
  222         20170905    eee              3      M        20000        0,
             loan_date loanID  loan_duration gender  loan_amount  default
  customerID                                                             
  555         20170304    yyy              6      F        40000        0))

不要害怕空的DataFrame,它很快就会被连接起来。 split 函数具有以下定义:

def split(df, train_size):
    df_s = group_customers_by_activity(df)
    train_splits, test_splits = get_sized_splits(df_s, train_size=train_size)
    return pd.concat(train_splits), pd.concat(test_splits)

split(df, 0.8)
Out[106]: 
(           loan_date loanID  loan_duration gender  loan_amount  default
 customerID                                                             
 444         20170515    bbb             12      M        40000        1
 444         20170819    zzz              4      M        30000        1
 555         20170304    yyy              6      F        40000        0,
            loan_date loanID  loan_duration gender  loan_amount  default
 customerID                                                             
 111         20170101    aaa              6      F        20000        0
 111         20170701    fff              3      F        10000        1
 111         20170814    kkk             17      F        30000        1
 222         20170301    ccc             12      M        30000        0
 222         20170905    eee              3      M        20000        0
 333         20170415    ddd              5      F        10000        0)

- 因此,customerID 被放置在训练数据或测试数据中。我猜想这样一个奇怪的缝隙(训练>测试)是因为输入数据很小。
如果您不需要按“customerID Activity”进行分组,则可以省略它并仅使用 split_group 来实现目标。

关于pandas - 分割数据以按条件进行训练和测试,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54389035/

相关文章:

python - Pandas:to_excel():如何将时间表示为 TimeSeries 索引的格式?

python-3.x - 值错误: setting an array element with a sequence in scikit-learn (sklearn) using GaussianNB

machine-learning - 分类准确率仅比随机挑选高5%

python - 在 Pandas 数据框中随机播放一列

python - 如何合并两个具有不同结束日期的时间序列数据框并保留较长的结束日期

python - 后向差分编码

r - 顺序神经网络

python - 使用 scikit 学习的线性回归进行时间序列交叉验证

python - 具有混合类型的 Pandas DataFrame 样式会产生 TypeError

python - 测试准确度为无