python pandas groupby和过滤函数删除候选人姓名,无需两次预测

标签 python pandas filtering pandas-groupby

问题是使用 Pandas,删除任何无效名称和任何没有两个预测的候选名称。在数据框中,某些候选名称在两个预测日期中出现两次,或者某些候选名称仅出现一次。所以我想放弃那些只有一个预测日期的候选人。

我正在尝试使用 groupby 和过滤函数来删除不满足两个条件的候选人姓名: ('forecast_date'== '2018-08-11') AND ('forecast_date'=='2018- 11-06')

这是我的代码:

election_sub=election_sub.dropna(subset=['candidate'])
election_sub.groupby('candidate')
grouped.filter(lambda x: (x['forecast_date']== '2018-08-11')&(x['forecast_date']=='2018-11-06'))

这是数据框: dataframe

最佳答案

用途:

#data to DataFrame
url = 'https://raw.githubusercontent.com/fivethirtyeight/checking-our-work-data/master/us_house_elections.csv'
election_sub = pd.read_csv(url, parse_dates=['election_date','forecast_date'])

#filter out `NaN`s
election_sub=election_sub.dropna(subset=['candidate'])

#filter rows for match one OR another datetime
df = election_sub[election_sub['forecast_date'].isin(['2018-08-11','2018-11-06'])].copy()
#get number of unique datetimes per groups
s = df.groupby('candidate')['forecast_date'].nunique()
#filter candidates only with both datetimes, like condition AND
cand = s.index[s.eq(2)].unique()
print (cand)

Index(['A. Donald McEachin', 'Aaron Andrus', 'Aaron Swisher',
       'Abby Finkenauer', 'Abigail Spanberger', 'Adam B. Schiff',
       'Adam Kinzinger', 'Adam Smith', 'Adrian Smith', 'Adriano Espaillat',
       ...
       'William Lacy Clay', 'William Tanoos', 'William Timmons',
       'Willie Billups', 'Xochitl Torres Small', 'Young Kim', 'Yvette Clarke',
       'Yvette Herrell', 'Yvonne Hayes Hinson', 'Zoe Lofgren'],
      dtype='object', name='candidate', length=960)

#filter original data by candidates
df = election_sub[election_sub['candidate'].isin(cand)]

如果两个条件至少有一个条件为 True,则您的解决方案是可能的 - 输出是 2 个标量,因此对于 AND 使用 and:

grouped = election_sub.groupby('candidate')
df = grouped.filter(lambda x: (x['forecast_date']== '2018-08-11').any() and (x['forecast_date']=='2018-11-06').any())

print(df)
        year office state  district special election_date forecast_date  \
0       2018  House    WY       1.0   False    2018-11-06    2018-11-06   
1       2018  House    WY       1.0   False    2018-11-06    2018-11-06   
2       2018  House    WY       1.0   False    2018-11-06    2018-11-06   
3       2018  House    WY       1.0   False    2018-11-06    2018-11-06   
4       2018  House    WY       1.0   False    2018-11-06    2018-11-06   
...      ...    ...   ...       ...     ...           ...           ...   
282688  2018  House    AK       1.0   False    2018-11-06    2018-08-01   
282689  2018  House    AK       1.0   False    2018-11-06    2018-08-01   
282690  2018  House    AK       1.0   False    2018-11-06    2018-08-01   
282691  2018  House    AK       1.0   False    2018-11-06    2018-08-01   
282692  2018  House    AK       1.0   False    2018-11-06    2018-08-01   

       forecast_type party        candidate  projected_voteshare  \
0               lite     D      Greg Hunter             33.29836   
1               lite     R       Liz Cheney             61.18835   
2             deluxe     D      Greg Hunter             31.37998   
3             deluxe     R       Liz Cheney             63.10673   
4            classic     D      Greg Hunter             31.33293   
...              ...   ...              ...                  ...   
282688          lite     R        Don Young             50.74973   
282689        deluxe     D  Alyse S. Galvin             41.49152   
282690        deluxe     R        Don Young             51.96705   
282691       classic     D  Alyse S. Galvin             44.10701   
282692       classic     R        Don Young             49.35155   

        actual_voteshare  probwin  probwin_outcome  
0                    NaN  0.00134                0  
1                    NaN  0.99866                1  
2                    NaN  0.00020                0  
3                    NaN  0.99980                1  
4                    NaN  0.00032                0  
...                  ...      ...              ...  
282688               NaN  0.76900                1  
282689               NaN  0.12776                0  
282690               NaN  0.87224                1  
282691               NaN  0.28146                0  
282692               NaN  0.71854                1  

[282240 rows x 14 columns]

编辑:

两种解决方案的性能不同:

In [41]: %%timeit
    ...: df = election_sub[election_sub['forecast_date'].isin(['2018-08-11','2018-11-06'])].copy()
    ...: #get number of unique datetimes per groups
    ...: s = df.groupby('candidate')['forecast_date'].nunique()
    ...: #filter candidates only with both datetimes, like condition AND
    ...: cand = s.index[s.eq(2)].unique()
    ...: 
    ...: #filter original data by candidates
    ...: df = election_sub[election_sub['candidate'].isin(cand)]
    ...: 
61.3 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %%timeit
    ...: grouped = election_sub.groupby('candidate')
    ...: df = grouped.filter(lambda x: (x['forecast_date']== '2018-08-11').any() and (x['forecast_date']=='2018-11-06').any())
    ...: 
1.07 s ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

关于python pandas groupby和过滤函数删除候选人姓名,无需两次预测,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61152417/

相关文章:

python - 根据任一索引处的值从两个列表中删除项目

Python 导入约定

python - 使用 ast.literal_eval() 将字符串转换为日期时间的方法?

python - Pandas 按三列分组,但保留所有其他列

javascript - 将 html 表单输入限制在某个 float 范围内

objective-c - 过滤自定义对象的 NSArray

python - Numpy 比纯 C 更快?

Python:过滤器(函数,序列)和映射(函数,序列)之间的区别

python - 从多个分组数据 pandas 中获取最大值

python - 如何将 png 转换为 python 的数据框?