python - Pandas .cut VS df.describe()

我想将数据分组为 4 个范围，我使用 Pandas.cut 进行分箱，这是我的代码和结果

然后我用了df.describe()我发现边缘的范围与 pd.cut 不同，为什么？

pd.cut是[(2.719, 3.042] < (3.042, 3.365] < (3.365, 3.688] < (3.688, 4.01]]

df.describe()是

min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000

最佳答案

你的 cut将范围分成 4 个等宽 bin，而describe使用四分位数。只有对于均匀分布的数据，两者才会导致相同的分割。

例子:

import pandas as pd
import numpy as np

df = pd.DataFrame({'uniform': np.random.rand(1_000_000), 'normal': np.random.randn(1_000_000)})

with np.printoptions(formatter={'float': '{:.3f}'.format}):
    print( 'uniform:\n'
           f'   {df.uniform.describe().iloc[3:].values}\n'
           f'   {pd.cut(df.uniform, 4).dtype.categories.to_tuples().to_list()}')
    print( 'normal:\n'
           f'   {df.normal.describe().iloc[3:].values}\n'
           f'   {pd.cut(df.normal, 4).dtype.categories.to_tuples().to_list()}')

输出:

uniform:
   [0.000 0.250 0.499 0.750 1.000]
   [(-0.001, 0.25), (0.25, 0.5), (0.5, 0.75), (0.75, 1.0)]
normal:
   [-4.908 -0.675 0.001 0.674 5.082]
   [(-4.918, -2.411), (-2.411, 0.0867), (0.0867, 2.584), (2.584, 5.082)]

关于python - Pandas .cut VS df.describe()，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72725247/

上一篇：google-apps-script - 如何在 Google 电子表格中以编程方式(在 Google 应用程序脚本中)为数据验证设置命名范围？

下一篇：r - R 中的快速 QR 分解

相关文章：

python - 在两个子图之间共享 Tensorflow 中的权重

python - python中将字符串与numpy数组中的格式化数字转换的最快方法是什么

python - 理解 FeatureUnion (pandas) 工作的困惑

python - 将字典列表的字典转换为数据框

python - 如何在Python中使用tabula提取PDF文件中存在的多个表格？

Python Selenium Chrome 禁用提示 "Trying to download multiple files"

python，NoneType对象没有属性 '__getitem__'，MySQL查询

python - 分割数据帧并根据循环写入新的 csv

python - 日期时间索引 : what is the purpose of 'freq' attribute?

python - 使用 pandas DataFrames 绘制条形图时如何添加 bin 内容的文本标签？