python-3.x - 创建所有列中包含连续 NaN 值的元组列表

标签 python-3.x pandas

我正在尝试构建一个具有连续开始日期和结束日期的元组列表,其中所有列都具有 NaN 值。

在下面的示例中,我的结果应该类似于:

missing_dates = [('2018-10-10 20:00:00', '2018-10-10 22:00:00'),
('2018-10-11 02:00:00', '2018-10-11 03:00:00 ')]

如果存在孤立的 NaN,则该值应在元组中重复。

带有可视化表格的字典示例。

   dicts = [
        {'datetime': '2018-10-10 18:00:00', 'variable1': 20, 'variable2': 30},
        {'datetime': '2018-10-10 19:00:00', 'variable1': 20, 'variable2': 30},
        {'datetime': '2018-10-10 19:00:00', 'variable1': 20, 'variable2': 30},
        {'datetime': '2018-10-10 19:00:00', 'variable1': 20, 'variable2': 30},
        {'datetime': '2018-10-10 20:00:00', 'variable1': np.nan, 'variable2': np.nan},
        {'datetime': '2018-10-10 21:00:00', 'variable1': np.nan, 'variable2': np.nan},
        {'datetime': '2018-10-10 22:00:00', 'variable1': np.nan, 'variable2': np.nan},
        {'datetime': '2018-10-10 23:00:00', 'variable1': 20, 'variable2': 30},
        {'datetime': '2018-10-10 23:00:00', 'variable1': 20, 'variable2': 30},
        {'datetime': '2018-10-11 00:00:00', 'variable1': 20, 'variable2': 30},
        {'datetime': '2018-10-11 01:00:00', 'variable1': np.nan, 'variable2': 30},
        {'datetime': '2018-10-11 02:00:00', 'variable1': np.nan, 'variable2': np.nan},
        {'datetime': '2018-10-11 03:00:00', 'variable1': np.nan, 'variable2': np.nan}]

表格表示:

----------------------+-----------+-----------+
|          datetime   | variable1 | variable2 |
+---------------------+-----------+-----------+
| 2018-10-10 18:00:00 |      20.0 |     30.0  |
| 2018-10-10 19:00:00 |      20.0 |     30.0  | 
| 2018-10-10 19:00:00 |      20.0 |     30.0  |
| 2018-10-10 19:00:00 |      20.0 |     30.0  |
| 2018-10-10 20:00:00 |       NaN |     NaN   |
| 2018-10-10 21:00:00 |       NaN |     NaN   |
| 2018-10-10 22:00:00 |       NaN |     NaN   |
| 2018-10-10 23:00:00 |      20.0 |     30.0  |
| 2018-10-10 23:00:00 |      20.0 |     30.0  | 
| 2018-10-11 00:00:00 |      20.0 |     30.0  |
| 2018-10-11 01:00:00 |       NaN |     30.0  |
| 2018-10-11 02:00:00 |       NaN |     NaN   |
| 2018-10-11 03:00:00 |       NaN |     NaN   |
+---------------------+-----------+-----------+

我做了什么:

df = pd.DataFrame(example_dict)
s = dframe.set_index('datetime').isnull().all(axis=1)
df['new_col'] = s.values
dframe.datetime = pd.to_datetime(dframe.datetime)
new_df = dframe.loc[dframe['new_col'] == True]
new_df['delta'] = (new_df['datetime'] - new_df['datetime'].shift(1))

我得到了一个带有增量的漂亮数据框,但我有点迷失了。

最佳答案

用途:

#create boolean mask for not NaNs rows
mask = df.drop('datetime', axis=1).notnull().any(axis=1)
#create groups for missing rows with same values
df['g'] = mask.cumsum()

#aggregate first and last, convert to nested lists and map to tuples
L = list(map(tuple, df[~mask].groupby('g')['datetime'].agg(['first','last']).values.tolist()))
print (L)
[('2018-10-10 20:00:00', '2018-10-10 22:00:00'), 
 ('2018-10-11 02:00:00', '2018-10-11 03:00:00')]

类似的解决方案,只是掩模被反转:

mask = df.drop('datetime', axis=1).isnull().all(axis=1)
df['g'] = (~mask).cumsum()

L = list(map(tuple, df[mask].groupby('g')['datetime'].agg(['first','last']).values.tolist()))
print (L)
[('2018-10-10 20:00:00', '2018-10-10 22:00:00'), 
 ('2018-10-11 02:00:00', '2018-10-11 03:00:00')]

关于python-3.x - 创建所有列中包含连续 NaN 值的元组列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52778097/

相关文章:

python-3.x - 如何使用 .join() 将字母转换为字符串

python - 根据数据框中的条件对值进行排序

python - 将 pandas 列中除第一个之外的重复数字替换为 NAN 值

python - 运行 def __init__(self) 函数后,如何在类对象中添加数据?

python-3.x - 根据 Pandas 中的同比变化和上一年的值计算多列的当前值

python - 使用字典定义计算 pandas 数据框中出现次数的条件

python - 使用 Python 组合多个 CSV 文件

python - Pandas/Python - 数据帧和字典之间的多重条件匹配

Python tkinter : browse directory and save to new directory

python - 以中间元素为轴心进行快速排序