machine-learning - 如何使用哈希码方法将数据集拆分为训练和测试数据集

标签 machine-learning train-test-split

我遵循 Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition 的代码.在创建训练和测试数据集部分,他们按照以下过程创建训练和测试数据集:

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

据作者说:

You can compute a hash of each instance's identifier and put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.

因此,我想了解这行代码的作用:crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

非常感谢任何帮助!

最佳答案

这可能有点晚了,但如果您仍在寻找答案,这里是 documentation crc32 函数:

Changed in version 3.0: Always returns an unsigned value. To generate the same numeric value across all Python versions and platforms, use crc32(data) & 0xffffffff.

因此,从本质上讲,它只是为了确保无论谁运行此函数,他们运行的是 Python 2 还是 3 都无关紧要。

关于machine-learning - 如何使用哈希码方法将数据集拆分为训练和测试数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58811081/

相关文章:

python - K-2 交叉验证本质上等于 50 的训练测试分割 :50?

python - 值错误: too many values to unpack(expected 2) - train_test_split

python - scikit learn 的 train_test_split( ) 方法

python - 需要在预测值任务中的领域 ML 或 DL 中推荐方法

machine-learning - k 均值与 LSH 算法

python - tf.keras 预测很差,而评估很好

python - 连接来自两种不同输入模态的两个不同形状的张量

machine-learning - 特征选择应该在训练测试分割之前还是之后进行?

scikit-learn - 使用 Sklearn 的 MinMaxScaler 按行缩放

machine-learning - 找到英语句子中形容词和副词对应的名词或动词