python - 使用 groupby 拆分数据框并将子集合并到列中

我有一个很大的pandas.DataFrame，看起来像这样:

test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)

id  score       name
0   -0.652909   A
1   0.100885    A
2   0.410907    A
0   0.304012    B
1   -0.198157   B
2   -0.054764   B
0   0.358484    C
1   0.616415    C
2   0.389018    C
3   1.164172    C

So the index is non-unique but is unique if I group by the column name. I would like to split the data frame into subsections by name and then assemble (by means of an outer join) the score columns into one big new data frame and change the column names of the scores to the respective group key. What I have at the moment is:

df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
    df = df.join(sub["score"], how="outer")
    df.columns.values[-1] = key

这会产生预期的结果:

id  A           B           C
0   -0.652909   0.304012    0.358484
1   0.100885    -0.198157   0.616415
2   0.410907    -0.054764   0.389018
3   NaN         NaN         1.164172

but seems not very pandas-ic. Is there a better way?

Edit: Based on the answers I ran some simple timings.

%%timeit
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
    df = df.join(sub["score"], how="outer")
    df.columns.values[-1] = key

100 loops, best of 3: 2.46 ms per loop

%%timeit
test.set_index([test.index, "name"]).unstack()

1000 loops, best of 3: 1.04 ms per loop

%%timeit
test.pivot_table("score", test.index, "name")

100 loops, best of 3: 2.54 ms per loop

因此，unstack 似乎是首选方法。

最佳答案

您要查找的函数是unstack 。为了让 pandas 知道要取消堆叠的目的，我们首先创建一个 MultiIndex，其中我们将列添加为 last 索引。然后，unstack() 将根据最后一个索引层取消堆叠(默认情况下)，因此我们得到的正是您想要的:

In[152]: test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
In[153]: test
Out[153]: 
      score name
0 -0.208392    A
1 -0.103659    A
2  1.645287    A
0  0.119709    B
1 -0.047639    B
2 -0.479155    B
0 -0.415372    C
1 -1.390416    C
2 -0.384158    C
3 -1.328278    C
In[154]: test.set_index([index, 'name'], inplace=True)
test.unstack()
Out[154]: 
         score                    
name         A         B         C
0    -0.208392  0.119709 -0.415372
1    -0.103659 -0.047639 -1.390416
2     1.645287 -0.479155 -0.384158
3          NaN       NaN -1.328278

关于python - 使用 groupby 拆分数据框并将子集合并到列中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24759397/

python - 使用 groupby 拆分数据框并将子集合并到列中

上一篇：python - 在 python 中直接从 CodeType 和 FunctionType 创建函数时出现奇怪的行为

下一篇：Python最快的字节到字符串转换