我有一个很大的pandas.DataFrame
,看起来像这样:
test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
id score name 0 -0.652909 A 1 0.100885 A 2 0.410907 A 0 0.304012 B 1 -0.198157 B 2 -0.054764 B 0 0.358484 C 1 0.616415 C 2 0.389018 C 3 1.164172 C
So the index is non-unique but is unique if I group by the column name
. I would like to split the data frame into subsections by name and then assemble (by means of an outer join) the score columns into one big new data frame and change the column names of the scores to the respective group key. What I have at the moment is:
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
df = df.join(sub["score"], how="outer")
df.columns.values[-1] = key
这会产生预期的结果:
id A B C 0 -0.652909 0.304012 0.358484 1 0.100885 -0.198157 0.616415 2 0.410907 -0.054764 0.389018 3 NaN NaN 1.164172
but seems not very pandas
-ic. Is there a better way?
Edit: Based on the answers I ran some simple timings.
%%timeit
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
df = df.join(sub["score"], how="outer")
df.columns.values[-1] = key
100 loops, best of 3: 2.46 ms per loop
%%timeit
test.set_index([test.index, "name"]).unstack()
1000 loops, best of 3: 1.04 ms per loop
%%timeit
test.pivot_table("score", test.index, "name")
100 loops, best of 3: 2.54 ms per loop
因此,unstack
似乎是首选方法。
最佳答案
您要查找的函数是unstack 。为了让 pandas
知道要取消堆叠的目的,我们首先创建一个 MultiIndex
,其中我们将列添加为 last 索引。然后,unstack()
将根据最后一个索引层取消堆叠(默认情况下),因此我们得到的正是您想要的:
In[152]: test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
In[153]: test
Out[153]:
score name
0 -0.208392 A
1 -0.103659 A
2 1.645287 A
0 0.119709 B
1 -0.047639 B
2 -0.479155 B
0 -0.415372 C
1 -1.390416 C
2 -0.384158 C
3 -1.328278 C
In[154]: test.set_index([index, 'name'], inplace=True)
test.unstack()
Out[154]:
score
name A B C
0 -0.208392 0.119709 -0.415372
1 -0.103659 -0.047639 -1.390416
2 1.645287 -0.479155 -0.384158
3 NaN NaN -1.328278
关于python - 使用 groupby 拆分数据框并将子集合并到列中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24759397/