python - 找到满足条件的特定值 - python

标签 python pandas dataframe indexing

尝试使用满足特定条件的值创建新列。下面我列出了一些代码,它在某种程度上解释了逻辑,但没有产生正确的输出:

import pandas as pd
import numpy as np


df = pd.DataFrame({'date': ['2019-08-06 09:00:00', '2019-08-06 12:00:00', '2019-08-06 18:00:00', '2019-08-06 21:00:00', '2019-08-07 09:00:00', '2019-08-07 16:00:00', '2019-08-08 17:00:00' ,'2019-08-09 16:00:00'], 
                'type': [0, 1, np.nan, 1, np.nan, np.nan, 0 ,0], 
                'colour': ['blue', 'red', np.nan, 'blue', np.nan, np.nan, 'blue', 'red'],
                'maxPixel': [255, 7346, 32, 5184, 600, 322, 72, 6000],
                'minPixel': [86, 96, 14, 3540, 528, 300, 12, 4009],
                'colourDate': ['2019-08-06 12:00:00', '2019-08-08 16:00:00', '2019-08-06 23:00:00', '2019-08-06 22:00:00', '2019-08-08 09:00:00', '2019-08-09 16:00:00', '2019-08-08 23:00:00' ,'2019-08-11 16:00:00'] })

max_conditions = [(df['type'] == 1) & (df['colour'] == 'blue'),
                  (df['type'] == 1) & (df['colour'] == 'red')]


max_choices = [np.where(df['date'] <= df['colourDate'], max(df['maxPixel']), np.nan),
                np.where(df['date'] <= df['colourDate'], min(df['minPixel']), np.nan)]


df['pixelLimit'] = np.select(max_conditions, max_choices, default=np.nan)
不正确的输出:
                  date  type colour  maxPixel  minPixel           colourDate  pixelLimit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00         NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00        12.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00         NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00      6000.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00         NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00         NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00         NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00         NaN
解释为什么输出不正确:
12.0在列 df['pixelLimit'] 的索引行 1 中不正确,因为该值来自 df['minPixel']索引行 6 具有 df['date'] 2019-08-08 17:00:00 的日期时间大于 2019-08-08 16:00:00 df['date']索引行 1 中包含的日期时间。
6000.0在列 df['pixelLimit'] 的索引行 3 中不正确,因为该值来自 df['maxPixel']索引行 7 具有 df['date'] 2019-08-09 16:00:00 的日期时间大于 2019-08-06 22:00:00 df['date']索引行中包含的日期时间。
正确的输出:
                  date  type colour  maxPixel  minPixel           colourDate  pixelLimit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00         NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00        14.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00         NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00      5184.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00         NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00         NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00         NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00         NaN
解释为什么输出正确:
14.0在列 df['pixelLimit'] 的索引行 1 中是正确的,因为我们正在寻找列 df['minPixel'] 中的最小值在 df['date'] 列中有一个日期时间小于 df['colourDate'] 列的索引行 1 中的日期时间并且大于或等于索引行 1 中列 df['date'] 的日期时间
5184.0在列 df['pixelLimit'] 的索引行 3 中是正确的,因为我们正在寻找列 df['maxPixel'] 中的最大值在 df['date'] 列中有一个日期时间小于 df['colourDate'] 列的索引第 3 行中的日期时间并且大于或等于索引行 3 中列 df['date'] 的日期时间
注意事项:
也许 np.select不是最适合这项任务,某种功能可能会更好地为这项任务服务?
另外,也许我需要创建某种动态 len用作每一行的起点?
请求
请任何人都可以帮助我修改我的代码以实现正确的输出

最佳答案

对于像这样的匹配问题,一种可能性是使用 bool 系列对所有满足条件的行(对于该行)进行完整合并,然后进行子集化,然后找到 maxmin在所有可能的匹配项中。由于这需要稍微不同的列和不同的函数,我将这些操作分成两段非常相似的代码,一段处理 1/blue,另一段处理 1/red。
首先一些家务,使事情日期时间

import pandas as pd

df['date'] = pd.to_datetime(df['date'])
df['colourDate'] = pd.to_datetime(df['colourDate'])

计算每行时间之间 1/红色的最小像素
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()

# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']], how='cross')
# If pd.version < 1.2 instead use: 
#dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')

# Only keep rows between the dates, then among those find the min minPixel
smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)]
            .groupby('index')['minPixel_y'].min()
            .rename('pixel_limit'))
#index
#1    14
#Name: pixel_limit, dtype: int64
# Max is basically a mirror
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()

dfmax = dfmax.merge(df[['date', 'maxPixel']], how='cross')
#dfmax = dfmax.assign(t=1).merge(df[['date', 'maxPixel']].assign(t=1), on='t')

smax = (dfmax[dfmax.date_y.between(dfmax.date_x, dfmax.colourDate)]
           .groupby('index')['maxPixel_y'].max()
           .rename('pixel_limit'))

最后,因为以上对原始索引(即 'index' )的分组,我们可以简单地分配回以与原始 DataFrame 对齐。
df['pixel_limit'] = pd.concat([smin, smax])

                 date  type colour  maxPixel  minPixel          colourDate  pixel_limit
0 2019-08-06 09:00:00   0.0   blue       255        86 2019-08-06 12:00:00          NaN
1 2019-08-06 12:00:00   1.0    red      7346        96 2019-08-08 16:00:00         14.0
2 2019-08-06 18:00:00   NaN    NaN        32        14 2019-08-06 23:00:00          NaN
3 2019-08-06 21:00:00   1.0   blue      5184      3540 2019-08-06 22:00:00       5184.0
4 2019-08-07 09:00:00   NaN    NaN       600       528 2019-08-08 09:00:00          NaN
5 2019-08-07 16:00:00   NaN    NaN       322       300 2019-08-09 16:00:00          NaN
6 2019-08-08 17:00:00   0.0   blue        72        12 2019-08-08 23:00:00          NaN
7 2019-08-09 16:00:00   0.0    red      6000      4009 2019-08-11 16:00:00          NaN

如果您需要为具有最小/最大像素的行带来许多不同的信息,那么代替 groupby min/max我们将 sort_values 然后 gropuby + headtail获取最小或最大像素。对于 min 这看起来像(后缀的轻微重命名):
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()

# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']].reset_index(), how='cross', 
                    suffixes=['', '_match'])
# For older pandas < 1.2
#dfmin = (dfmin.assign(t=1)
#              .merge(df[['date', 'minPixel']].reset_index().assign(t=1), 
#                     on='t', suffixes=['', '_match'])) 

# Only keep rows between the dates, then among those find the min minPixel row. 
# A bunch of renaming. 
smin = (dfmin[dfmin.date_match.between(dfmin.date, dfmin.colourDate)]
            .sort_values('minPixel_match', ascending=True)
            .groupby('index').head(1)
            .set_index('index')
            .filter(like='_match')
            .rename(columns={'minPixel_match': 'pixel_limit'}))
Max 将与使用 .tail 类似
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']].reset_index(), how='cross', 
                    suffixes=['', '_match'])

smax = (dfmax[dfmax.date_match.between(dfmax.date, dfmin.colourDate)]
            .sort_values('maxPixel_match', ascending=True)
            .groupby('index').tail(1)
            .set_index('index')
            .filter(like='_match')
            .rename(columns={'maxPixel_match': 'pixel_limit'}))
最后我们连接 axis=1现在我们需要将多个列连接到原始列:
result = pd.concat([df, pd.concat([smin, smax])], axis=1)
                  date  type colour  maxPixel  minPixel           colourDate  index_match           date_match  pixel_limit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00          NaN                  NaN          NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00          2.0  2019-08-06 18:00:00         14.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00          NaN                  NaN          NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00          3.0  2019-08-06 21:00:00       5184.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00          NaN                  NaN          NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00          NaN                  NaN          NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00          NaN                  NaN          NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00          NaN                  NaN          NaN

关于python - 找到满足条件的特定值 - python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66165400/

相关文章:

python - 使用 pySpark 将 DataFrame 写入 mysql 表

python - 使用 split-apply-combine 通过自定义函数删除一些值并合并剩下的值

python - 查找股票行情中带有句点的股票时,pandas 数据读取器出现错误

python - 如何在给定整数索引的情况下检索 pandas 数据帧行的标签索引?

python - 三列向量化运算

pandas - PySpark 数据帧 Pandas UDF 返回空数据帧

python - 如何使用 django (python) 和 s3 上传文件?

python - ipython上下箭头奇怪的行为

python - 是什么原因导致Python中出现这个属性错误?

Python Pandas : remove entries based on the number of occurrences