python - 当数据存在联系时，如何计算 Pandas 中的分位数箱？

考虑以下简单示例。我有兴趣获得一个包含与分位数对应的类别的分类变量。

  df = pd.DataFrame({'A':'foo foo foo bar bar bar'.split(),
                       'B':[0, 0, 1]*2})

df
Out[67]: 
     A  B
0  foo  0
1  foo  0
2  foo  1
3  bar  0
4  bar  0
5  bar  1

在 Pandas 中，qtile 完成这项工作。不幸的是，由于数据的联系，qtile 在这里会失败。

df['C'] = df.groupby(['A'])['B'].transform(
                     lambda x: pd.qcut(x, 3, labels=range(1,4)))

给出了经典的ValueError:Bin边缘必须是唯一的:array([0.，0.，0.33333333，1.])

是否有另一个强大的解决方案(来自任何其他 python 包)不需要重新发明轮子？

必须是这样。我不想自己编写自己的分位数 bin 函数。任何像样的统计数据包都可以在创建分位数箱时处理关系(SAS、Stata 等)。

我想要一些基于合理方法选择且稳健的东西。

例如，在此处查找 SAS https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146840.htm 中的解决方案.

或者在这里查看 Stata 中众所周知的 xtile ( http://www.stata.com/manuals13/dpctile.pdf )。请注意这个帖子Definitive way to match Stata weighted xtile command using Python?

我错过了什么？也许使用Scipy？

非常感谢!

最佳答案

IIUC，您可以使用numpy.digitize

df['C'] = df.groupby(['A'])['B'].transform(lambda x: np.digitize(x,bins=np.array([0,1,2])))

     A  B  C
0  foo  0  1
1  foo  0  1
2  foo  1  2
3  bar  0  1
4  bar  0  1
5  bar  1  2

关于python - 当数据存在联系时，如何计算 Pandas 中的分位数箱？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38594277/

上一篇：python - Double Group-by 然后应用一些功能？

下一篇：python - pandas 中的多个引号

相关文章：

python - 如何提高 Python 中 odeint 的速度？

python - 使用 pygtk 如何制作简单的全屏幻灯片放映

python - 同步本地和远程数据库

python - 从请求中获取字节响应

python - 使用 python 中的 ruby gem/命令

python - scipy 最小化 SLSQP - 'Singular matrix C in LSQ subproblem'

python - 使用Python进行超几何分布的差异

python - 将列表列表获取到 pandas DataFrame

python - 如何将 DataFrame 的每一行导出到同一工作簿中的不同工作表？

python - 属性错误: module 'pandas.compat' has no attribute 'iteritems'