python - 使用每组至少一个具有非缺失 cusip 标识符的记录来过滤所有重复观察

我有一个包含公司信息的数据集。具体来说，有两个列，称为 CUSIP 和 RCID。我想首先取出所有具有重复 RCID 的观察结果。所以如果我的数据集如下，我想取出记录3,4,5,6,7,8。我能够做到这一点

fin_rcid_dup = df['rcid'].duplicated(keep=False)
filtered_df = df[fin_rcid_dup]

<表类=“s-表”> <标题> 索引 RCID CUSIP <正文> 1 478923 。 2 346422 。 3 362736 。 4 362736 4637468 5 362736 。 6 673253 。 7 673253 。 8 362736 。

我想进一步取出一组记录(具有相同rcid的观测值属于同一组)，其中至少有一个记录具有非缺失的cusip，即我想取出3,4,5， 8，但不是 6,7，因为 rcid == 673253 的所有记录都错过了 cusip。你知道我该怎么做吗？

谢谢!

我试过了

has_duplicates = df['rcid'].duplicated(keep=False)

cusip_not_missing = df['cusip'].isna()

filtered_df = df[has_duplicates & cusip_not_missing]

但是这样我只能取出4个，因为对于rcid == 362736的其他记录，他们缺少cusip

最佳答案

使用notna和groupby.transform :

has_duplicates = df['rcid'].duplicated(keep=False)

cusip_not_missing = df['cusip'].notna().groupby(df['rcid']).transform('any')

filtered_df = df[has_duplicates & cusip_not_missing]

关于python - 使用每组至少一个具有非缺失 cusip 标识符的记录来过滤所有重复观察，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/77129775/

上一篇：css - 如何仅在 CSS 中使用 Flexbox 将 CSS 更改为垂直制表符

下一篇：node.js - 爬取数据时如何获取MathJax中的元素？

相关文章：

python - 连接 panda 的选定列，同时忽略列中的空白

python - 如何在 Python 中导入私有(private)方法？

python - 值错误 : can only call with other PeriodIndex-ed objects

pandas - 在 Pandas 中用 NaN 替换连续的 0

python - 以下软件包将被更高优先级的 channel 取代

python - 如何在 xticks 上方左右移动分类散点标记(每个类别多个数据集)？

python - 如何在Python中解析字符串以列出？

python - 访问具有先前非唯一日期的行

python - pySpark forEachPartition - 代码在哪里执行

python - 一系列替换函数的统一函数