python - 使用 groupby 拆分数据框并将子集合并到列中

标签 python pandas merge group-by outer-join

我有一个很大的pandas.DataFrame,看起来像这样:

test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
id  score       name
0   -0.652909   A
1   0.100885    A
2   0.410907    A
0   0.304012    B
1   -0.198157   B
2   -0.054764   B
0   0.358484    C
1   0.616415    C
2   0.389018    C
3   1.164172    C

So the index is non-unique but is unique if I group by the column name. I would like to split the data frame into subsections by name and then assemble (by means of an outer join) the score columns into one big new data frame and change the column names of the scores to the respective group key. What I have at the moment is:

df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
    df = df.join(sub["score"], how="outer")
    df.columns.values[-1] = key

这会产生预期的结果:

id  A           B           C
0   -0.652909   0.304012    0.358484
1   0.100885    -0.198157   0.616415
2   0.410907    -0.054764   0.389018
3   NaN         NaN         1.164172

but seems not very pandas-ic. Is there a better way?

Edit: Based on the answers I ran some simple timings.

%%timeit
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
    df = df.join(sub["score"], how="outer")
    df.columns.values[-1] = key
100 loops, best of 3: 2.46 ms per loop
%%timeit
test.set_index([test.index, "name"]).unstack()
1000 loops, best of 3: 1.04 ms per loop
%%timeit
test.pivot_table("score", test.index, "name")
100 loops, best of 3: 2.54 ms per loop

因此,unstack 似乎是首选方法。

最佳答案

您要查找的函数是unstack 。为了让 pandas 知道要取消堆叠的目的,我们首先创建一个 MultiIndex,其中我们将列添加为 last 索引。然后,unstack() 将根据最后一个索引层取消堆叠(默认情况下),因此我们得到的正是您想要的:

In[152]: test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
In[153]: test
Out[153]: 
      score name
0 -0.208392    A
1 -0.103659    A
2  1.645287    A
0  0.119709    B
1 -0.047639    B
2 -0.479155    B
0 -0.415372    C
1 -1.390416    C
2 -0.384158    C
3 -1.328278    C
In[154]: test.set_index([index, 'name'], inplace=True)
test.unstack()
Out[154]: 
         score                    
name         A         B         C
0    -0.208392  0.119709 -0.415372
1    -0.103659 -0.047639 -1.390416
2     1.645287 -0.479155 -0.384158
3          NaN       NaN -1.328278

关于python - 使用 groupby 拆分数据框并将子集合并到列中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24759397/

相关文章:

python - 我如何在 Pyramid 中使用烧杯缓存?

python - 使用 PYODBC 从 pandas 获取数据到 SQL 服务器

python - 如何更改seaborn散点图矩阵中的绘图轴,sns.pairplot()

javascript - 仅当特定索引是有效数字时,如何用第二个数组覆盖数组?

GitHub:重新打开合并的拉取请求

python - 任何 Python IDE 都支持在调试器的断点处停止

Python:在字符串中的某些字符后打印4个字符

svn - 如何解决移动/重命名文件夹的非递归提交

python - 将json dict转换为pandas df中的行

python - 减去不同滞后的日期