我有一个 DataFrame,我已确认每一行中的值不超过一个(其余为 np.nan)。如何将其转换为一维数组或系列?
假设这是我的起始数组:
In [7]: import pandas as pd
In [8]: data = [
[np.nan, 9.0, np.nan],
[np.nan, np.nan, 3.0],
[np.nan, np.nan, 5.0],
[np.nan, np.nan, np.nan],
[1.0, np.nan, np.nan]
]
In [9]: a = pd.DataFrame(data)
In [10]: a
Out[10]:
0 1 2
0 NaN 9.0 NaN
1 NaN NaN 3.0
2 NaN NaN 5.0
3 NaN NaN NaN
4 1.0 NaN NaN
我想创建以下系列 b:
In [17]: b
Out[17]:
0 9.0
1 3.0
2 5.0
3 NaN
4 1.0
dtype: float64
我已经写了一些代码来做到这一点:
In [14]: m = a.notnull()
In [15]: m
Out[15]:
0 1 2
0 False True False
1 False False True
2 False False True
3 False False False
4 True False False
In [16]: for i, row in a.iterrows():
for j, v in row.iteritems():
if m.iloc[i, j]:
b[i] = v
但必须有更简单的方法!
我尝试使用 np.max
和 np.sum
但它们都返回一个空 (nan) 数组。
最佳答案
您可以使用 first_valid_index
, 但如果所有值都是 NaN
则需要条件:
def f(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
b = a.apply(f, axis=1)
print (b)
0 9.0
1 3.0
2 5.0
3 NaN
4 1.0
dtype: float64
另一种解决方案 sum
和 numpy.where
:
print (pd.Series(np.where(a.notnull().any(1), a.sum(1), np.nan)))
0 9.0
1 3.0
2 5.0
3 NaN
4 1.0
dtype: float64
np.max
的解决方案也很好用:
print (np.max(a, axis=1))
0 9.0
1 3.0
2 5.0
3 NaN
4 1.0
dtype: float64
或者更简单和最快的max
:
print (a.max(axis=1))
0 9.0
1 3.0
2 5.0
3 NaN
4 1.0
dtype: float64
时间:
a = pd.concat([a]*10000).reset_index(drop=True)
In [133]: %timeit (a.max(axis=1))
100 loops, best of 3: 2.81 ms per loop
In [134]: %timeit (np.max(a, axis=1))
100 loops, best of 3: 2.83 ms per loop
In [135]: %timeit (pd.Series(np.where(a.notnull().any(1), a.sum(1), np.nan)))
100 loops, best of 3: 3.18 ms per loop
In [136]: %timeit (a.apply(f, axis=1))
1 loop, best of 3: 2.18 s per loop
#http://stackoverflow.com/a/39011722/2901002
In [137]: %timeit a.max(axis=1, skipna=True)
100 loops, best of 3: 2.84 ms per loop
def user(dataDF):
squash = pd.Series(index=dataDF.index)
for col in dataDF.columns.values:
squash.update(dataDF[col])
return squash
print(user(a))
In [151]: %timeit (user(a))
100 loops, best of 3: 7.75 ms per loop
通过评论编辑:
如果值不是数字,您可以使用:
import pandas as pd
import numpy as np
data = [
[np.nan, 'a', np.nan],
[np.nan, np.nan, 'b'],
[np.nan, np.nan, 'c'],
[np.nan, np.nan, np.nan],
['d', np.nan, np.nan]
]
a = pd.DataFrame(data)
print (a)
0 1 2
0 NaN a NaN
1 NaN NaN b
2 NaN NaN c
3 NaN NaN NaN
4 d NaN NaN
print (a.fillna('').sum(axis=1).mask(a.isnull().all(1)))
0 a
1 b
2 c
3 NaN
4 d
dtype: object
关于python - 我如何将 DataFrame 中的值 'squash' 我知道每行只有一个项目放入系列中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39011192/