python - scikit-learn 中的机器学习算法是否必须将 pandas 数据帧转换为 numpy 数组

关闭。这个问题需要多问focused 。目前不接受答案。

想要改进此问题吗？更新问题，使其仅关注一个问题 editing this post .

已关闭 3 年前。

scikit-learn 中的机器学习算法是否必须将 pandas 数据帧转换为 numpy 数组？

我知道to_numpy()函数进行转换。这意味着我也必须为 pandas 数据框中的分类列手动创建一个虚拟矩阵。

如果我只使用 pandas dataframe 作为 scikit-learn 中的输入会发生什么？如果我将 pandas 数据帧转换为 numpy 数组，那么这是否意味着我的列名称不再保留在机器学习算法中？当涉及到模型诊断时，需要采取额外的步骤来协调列名与 numpy 数组吗？

最佳答案

提供 float 组是一个安全的选择，但这不是必须的。无论您提供什么，都将尝试在内部转换为 numpy 数组。如果它不是类似数组(见下文)，则会引发异常。

如果你采取RandomForestRegressor例如，您会在 sklearn 中发现它们有一个类似数组的概念。例如，请参阅 RandomForestRegressor.fit() 的文档字符串:

X{array-like, sparse matrix} of shape (n_samples, n_features) The training input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csc_matrix.

您可以通过阅读 glossary 进一步了解什么是类数组 :

array-like
The most common data format for input to Scikit-learn estimators and functions, array-like is any type object for which numpy.asarray will produce an array of appropriate shape (usually 1 or 2-dimensional) of appropriate dtype (usually numeric).

This includes:

a numpy array

a list of numbers

a list of length-k lists of numbers for some fixed length k

a pandas.DataFrame with all columns numeric

a numeric pandas.Series

It excludes:

a sparse matrix

an iterator

a generator

如果您浏览source ，您会发现您提供给方法的数据将流经 self._validate_data ，这将为您进行转换。

您可以随时通过 sklearn.utils.check_array 提前检查您的数据是否可接受，但它没有太大的实际意义，因为当您向方法提供数据时，无论如何它都会为您完成。

关于python - scikit-learn 中的机器学习算法是否必须将 pandas 数据帧转换为 numpy 数组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65273553/

python - scikit-learn 中的机器学习算法是否必须将 pandas 数据帧转换为 numpy 数组

上一篇：python - Pygame 与不同的 python 解释器运行方式不同

下一篇：python - 在函数中使用exec函数并定义变量