python - 我如何将 DataFrame 中的值 'squash' 我知道每行只有一个项目放入系列中？

我有一个 DataFrame，我已确认每一行中的值不超过一个(其余为 np.nan)。如何将其转换为一维数组或系列？

假设这是我的起始数组:

In [7]: import pandas as pd

In [8]: data = [
    [np.nan, 9.0, np.nan],
    [np.nan, np.nan, 3.0],
    [np.nan, np.nan, 5.0],
    [np.nan, np.nan, np.nan],
    [1.0, np.nan, np.nan]
]

In [9]: a = pd.DataFrame(data)

In [10]: a
Out[10]: 
     0    1    2
0  NaN  9.0  NaN
1  NaN  NaN  3.0
2  NaN  NaN  5.0
3  NaN  NaN  NaN
4  1.0  NaN  NaN

我想创建以下系列 b:

In [17]: b
Out[17]: 
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

我已经写了一些代码来做到这一点:

In [14]: m = a.notnull()

In [15]: m
Out[15]: 
       0      1      2
0  False   True  False
1  False  False   True
2  False  False   True
3  False  False  False
4   True  False  False

In [16]: for i, row in a.iterrows():
        for j, v in row.iteritems():
                if m.iloc[i, j]:
                        b[i] = v

但必须有更简单的方法!

我尝试使用 np.max 和 np.sum 但它们都返回一个空 (nan) 数组。

最佳答案

您可以使用 first_valid_index , 但如果所有值都是 NaN 则需要条件:

def f(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]

b = a.apply(f, axis=1)

print (b)
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

另一种解决方案 sum和 numpy.where :

print (pd.Series(np.where(a.notnull().any(1), a.sum(1), np.nan)))
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

np.max 的解决方案也很好用:

print (np.max(a, axis=1))
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

或者更简单和最快的max :

print (a.max(axis=1))
0    9.0
1    3.0
2    5.0
3    NaN
4    1.0
dtype: float64

时间:

a = pd.concat([a]*10000).reset_index(drop=True)

In [133]: %timeit (a.max(axis=1))
100 loops, best of 3: 2.81 ms per loop

In [134]: %timeit (np.max(a, axis=1))
100 loops, best of 3: 2.83 ms per loop

In [135]: %timeit (pd.Series(np.where(a.notnull().any(1), a.sum(1), np.nan)))
100 loops, best of 3: 3.18 ms per loop

In [136]: %timeit (a.apply(f, axis=1))
1 loop, best of 3: 2.18 s per loop

#http://stackoverflow.com/a/39011722/2901002
In [137]: %timeit a.max(axis=1, skipna=True)
100 loops, best of 3: 2.84 ms per loop

def user(dataDF):

    squash = pd.Series(index=dataDF.index)
    for col in dataDF.columns.values:
        squash.update(dataDF[col])
    return squash

print(user(a))
In [151]: %timeit (user(a))
100 loops, best of 3: 7.75 ms per loop

通过评论编辑:

如果值不是数字，您可以使用:

import pandas as pd
import numpy as np

data = [
    [np.nan, 'a', np.nan],
    [np.nan, np.nan, 'b'],
    [np.nan, np.nan, 'c'],
    [np.nan, np.nan, np.nan],
    ['d', np.nan, np.nan]
]

a = pd.DataFrame(data)
print (a)
     0    1    2
0  NaN    a  NaN
1  NaN  NaN    b
2  NaN  NaN    c
3  NaN  NaN  NaN
4    d  NaN  NaN

print (a.fillna('').sum(axis=1).mask(a.isnull().all(1)))
0      a
1      b
2      c
3    NaN
4      d
dtype: object

关于python - 我如何将 DataFrame 中的值 'squash' 我知道每行只有一个项目放入系列中？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39011192/

python - 我如何将 DataFrame 中的值 'squash' 我知道每行只有一个项目放入系列中？

上一篇：python - 如何在列表理解中定义变量？

下一篇：python - 我越来越需要多个值来解压值错误