我遵循 Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition
的代码.在创建训练和测试数据集部分,他们按照以下过程创建训练和测试数据集:
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
据作者说:
You can compute a hash of each instance's identifier and put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
因此,我想了解这行代码的作用:crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
非常感谢任何帮助!
最佳答案
这可能有点晚了,但如果您仍在寻找答案,这里是 documentation crc32 函数:
Changed in version 3.0: Always returns an unsigned value. To generate the same numeric value across all Python versions and platforms, use crc32(data) & 0xffffffff.
因此,从本质上讲,它只是为了确保无论谁运行此函数,他们运行的是 Python 2 还是 3 都无关紧要。
关于machine-learning - 如何使用哈希码方法将数据集拆分为训练和测试数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58811081/