R:将 dgCMatrix 拆分为训练矩阵和测试矩阵，用于 XGBoost 训练

标签 r machine-learning categorical-data xgboost

首先，我是 XGBoost 的新手。所以请原谅我的愚蠢。

问题是:

如何将 dgCMatrix 分成两个矩阵(例如，训练和测试)？我的目标是使用这些矩阵进行 XGBoost 训练。当我使用 one-hot 编码将所有分类变量转换为数值变量时，我得到了 dgCMatrix。可以对训练数据集和测试数据集分别进行one-hot编码吗？

我已尝试使用 dummyVars(来自包插入符)进行 one-hot 编码，但我的 R session 由于某种我不知道的原因而中止。

最佳答案

在此处添加 DexGroves 的评论作为答案，因为它回答了问题。

Even if you split your dataset into two (say, A and B), the information about all levels of a factor will be stored in both A and B even if some of the levels are not present in either A or B. So when you do one hot encoding on a subset, it encodes all the levels irrespective of whether the levels are present in the subset or not. And it uses the same encoding on the next subset.

关于R:将 dgCMatrix 拆分为训练矩阵和测试矩阵，用于 XGBoost 训练，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39330087/

上一篇：python - SVM Scikit-Learn : Why prediction time decrease with SVC when increasing parameter C?

下一篇：python - Tensorflow - 使用批处理进行预测

r - 当因变量是因子/分类变量时的线性模型 (lm)？

正则表达式删除R中的特定多字节字符

python - Keras:如何计算多标签分类的准确度？

R - 将事件日志(异步日志)转换为时间序列(同步日志)

python - 如何在核密度估计中找到局部最大值？

Python-sklearn.MLPClassifier : How to obtain output of the first hidden layer

machine-learning - 使用 word2vec 来编码分类特征是个好主意吗？

r - 我们如何配置shinyserver开源支持并发用户

r - Octave/Matlab 中的scale() R 函数等效项