问题是使用 Pandas,删除任何无效名称和任何没有两个预测的候选名称。在数据框中,某些候选名称在两个预测日期中出现两次,或者某些候选名称仅出现一次。所以我想放弃那些只有一个预测日期的候选人。
我正在尝试使用 groupby 和过滤函数来删除不满足两个条件的候选人姓名: ('forecast_date'== '2018-08-11') AND ('forecast_date'=='2018- 11-06')
这是我的代码:
election_sub=election_sub.dropna(subset=['candidate'])
election_sub.groupby('candidate')
grouped.filter(lambda x: (x['forecast_date']== '2018-08-11')&(x['forecast_date']=='2018-11-06'))
最佳答案
用途:
#data to DataFrame
url = 'https://raw.githubusercontent.com/fivethirtyeight/checking-our-work-data/master/us_house_elections.csv'
election_sub = pd.read_csv(url, parse_dates=['election_date','forecast_date'])
#filter out `NaN`s
election_sub=election_sub.dropna(subset=['candidate'])
#filter rows for match one OR another datetime
df = election_sub[election_sub['forecast_date'].isin(['2018-08-11','2018-11-06'])].copy()
#get number of unique datetimes per groups
s = df.groupby('candidate')['forecast_date'].nunique()
#filter candidates only with both datetimes, like condition AND
cand = s.index[s.eq(2)].unique()
print (cand)
Index(['A. Donald McEachin', 'Aaron Andrus', 'Aaron Swisher',
'Abby Finkenauer', 'Abigail Spanberger', 'Adam B. Schiff',
'Adam Kinzinger', 'Adam Smith', 'Adrian Smith', 'Adriano Espaillat',
...
'William Lacy Clay', 'William Tanoos', 'William Timmons',
'Willie Billups', 'Xochitl Torres Small', 'Young Kim', 'Yvette Clarke',
'Yvette Herrell', 'Yvonne Hayes Hinson', 'Zoe Lofgren'],
dtype='object', name='candidate', length=960)
#filter original data by candidates
df = election_sub[election_sub['candidate'].isin(cand)]
如果两个条件至少有一个条件为 True,则您的解决方案是可能的 - 输出是 2 个标量,因此对于 AND
使用 and
:
grouped = election_sub.groupby('candidate')
df = grouped.filter(lambda x: (x['forecast_date']== '2018-08-11').any() and (x['forecast_date']=='2018-11-06').any())
print(df)
year office state district special election_date forecast_date \
0 2018 House WY 1.0 False 2018-11-06 2018-11-06
1 2018 House WY 1.0 False 2018-11-06 2018-11-06
2 2018 House WY 1.0 False 2018-11-06 2018-11-06
3 2018 House WY 1.0 False 2018-11-06 2018-11-06
4 2018 House WY 1.0 False 2018-11-06 2018-11-06
... ... ... ... ... ... ... ...
282688 2018 House AK 1.0 False 2018-11-06 2018-08-01
282689 2018 House AK 1.0 False 2018-11-06 2018-08-01
282690 2018 House AK 1.0 False 2018-11-06 2018-08-01
282691 2018 House AK 1.0 False 2018-11-06 2018-08-01
282692 2018 House AK 1.0 False 2018-11-06 2018-08-01
forecast_type party candidate projected_voteshare \
0 lite D Greg Hunter 33.29836
1 lite R Liz Cheney 61.18835
2 deluxe D Greg Hunter 31.37998
3 deluxe R Liz Cheney 63.10673
4 classic D Greg Hunter 31.33293
... ... ... ... ...
282688 lite R Don Young 50.74973
282689 deluxe D Alyse S. Galvin 41.49152
282690 deluxe R Don Young 51.96705
282691 classic D Alyse S. Galvin 44.10701
282692 classic R Don Young 49.35155
actual_voteshare probwin probwin_outcome
0 NaN 0.00134 0
1 NaN 0.99866 1
2 NaN 0.00020 0
3 NaN 0.99980 1
4 NaN 0.00032 0
... ... ... ...
282688 NaN 0.76900 1
282689 NaN 0.12776 0
282690 NaN 0.87224 1
282691 NaN 0.28146 0
282692 NaN 0.71854 1
[282240 rows x 14 columns]
编辑:
两种解决方案的性能不同:
In [41]: %%timeit
...: df = election_sub[election_sub['forecast_date'].isin(['2018-08-11','2018-11-06'])].copy()
...: #get number of unique datetimes per groups
...: s = df.groupby('candidate')['forecast_date'].nunique()
...: #filter candidates only with both datetimes, like condition AND
...: cand = s.index[s.eq(2)].unique()
...:
...: #filter original data by candidates
...: df = election_sub[election_sub['candidate'].isin(cand)]
...:
61.3 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: %%timeit
...: grouped = election_sub.groupby('candidate')
...: df = grouped.filter(lambda x: (x['forecast_date']== '2018-08-11').any() and (x['forecast_date']=='2018-11-06').any())
...:
1.07 s ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
关于python pandas groupby和过滤函数删除候选人姓名,无需两次预测,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61152417/