给定一个由一列可变大小列表组成的数据框:
Col1
0 [SF, NYG, 123]
1 [SF, NYG, test, test]
2 [SF, NYG, foo]
3 [SF, NYG]
4 [SF, NYG, 45]
5 [SF, NYG]
6 [SF, NYG, 32]
如何将其转换为多列数据框?我希望空值具有 NaN
,如下所示:
Col1 Col2 Col3 Col4
0 SF NYG 123 NaN
1 SF NYG test test
2 SF NYG foo NaN
3 SF NYG NaN NaN
4 SF NYG 45 NaN
5 SF NYG NaN NaN
6 SF NYG 32 NaN
我能够想出
df_new = pd.DataFrame(df.Col1.tolist()).applymap(lambda x: np.nan if not x else x)
但我找不到一种优雅地重命名列的方法。
最佳答案
您可以将 DataFrame
构造函数与 values
一起使用对于 list
中的 numpy array
。然后将 None
替换为 NaN
并重命名列,最后 add_prefix
df_new = pd.DataFrame(df.Col1.values.tolist())
.fillna(np.nan)
.rename(columns = lambda x: x + 1)
.add_prefix('Col')
print (df_new)
Col1 Col2 Col3 Col4
0 SF NYG 123 NaN
1 SF NYG test test
2 SF NYG foo NaN
3 SF NYG NaN NaN
4 SF NYG 45 NaN
5 SF NYG NaN NaN
6 SF NYG 32 NaN
时间:
#700
df = pd.concat([df]*100).reset_index(drop=True)
#Jez
In [10]: %timeit (pd.DataFrame(df.Col1.values.tolist()))
1000 loops, best of 3: 694 µs per loop
#cᴏʟᴅsᴘᴇᴇᴅ
In [11]: %timeit (pd.DataFrame(df.Col1.tolist()))
1000 loops, best of 3: 705 µs per loop
#Wen
In [12]: %timeit (df.Col1.apply(lambda x: ','.join(str(y) for y in x)).str.split(',', expand=True))
100 loops, best of 3: 3.51 ms per loop
#slowier
In [13]: %timeit (df.Col1.apply(pd.Series))
10 loops, best of 3: 159 ms per loop
#7k
df = pd.concat([df]*1000).reset_index(drop=True)
#jez
In [30]: %timeit (pd.DataFrame(df.Col1.values.tolist()))
1000 loops, best of 3: 1.26 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ
In [31]: %timeit (pd.DataFrame(df.Col1.tolist()))
1000 loops, best of 3: 1.37 ms per loop
#Wen
In [32]: %timeit (df.Col1.apply(lambda x: ','.join(str(y) for y in x)).str.split(',', expand=True))
10 loops, best of 3: 29 ms per loop
#very slow, the best use only in small dataframes
In [33]: %timeit (df.Col1.apply(pd.Series))
1 loop, best of 3: 1.58 s per loop
#700k
df = pd.concat([df]*100000).reset_index(drop=True)
#jez
In [40]: %timeit (pd.DataFrame(df.Col1.values.tolist()))
10 loops, best of 3: 80.3 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ
In [41]: %timeit (pd.DataFrame(df.Col1.tolist()))
10 loops, best of 3: 90.5 ms per loop
#Wen
In [42]: %timeit (df.Col1.apply(lambda x: ','.join(str(y) for y in x)).str.split(',', expand=True))
1 loop, best of 3: 2.91 s per loop
#extremely slow
In [3]: %timeit (df.Col1.apply(pd.Series))
1 loop, best of 3: 3min 58s per loop
关于python - 将不规则列表的单列数据框分解为多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45645847/