python - 处理 Pandas 中的稀疏类别 - 用 "Other"替换所有不在顶级类别中的内容

标签 python pandas dataframe counter data-cleaning

我在清理数据的时候经常遇到以下常见问题还有一些更常见的类别(比如前 10 名电影类型)和许多其他稀疏的类别。例如，这里通常的做法是将稀疏类型组合到“其他”中。

当稀疏类别不多时很容易做到:

# Join bungalows as they are sparse classes into 1
df.property_type.replace(['Terraced bungalow','Detached bungalow', 'Semi-detached bungalow'], 'Bungalow', inplace=True)

但是，例如，如果我有一个电影数据集，其中大部分电影是由 8 个大工作室制作的，我想将其他所有内容组合在“其他”工作室下，那么获得前 8 个工作室是有意义的:

top_8_list = []
top_8 = df.studio.value_counts().head(8)
for key, value in top_8.iteritems():
    top_8_list.append(key)

top_8_list
top_8_list
['Universal Pictures',
 'Warner Bros.',
 'Paramount Pictures',
 'Twentieth Century Fox Film Corporation',
 'New Line Cinema',
 'Columbia Pictures Corporation',
 'Touchstone Pictures',
 'Columbia Pictures']

然后做类似的事情

将studio不在前8名的studio替换为“other”

所以问题是，是否有人知道 pandas 对此有任何优雅的解决方案？这是非常常见的数据清理任务

最佳答案

您可以使用 pd.DataFrame.loc使用 bool 索引:

df.loc[~df['studio'].isin(top_8_list), 'studio'] = 'Other'

请注意，无需通过手动 for 循环构建前 8 个工作室的列表:

top_8_list = df['studio'].value_counts().index[:8]

关于python - 处理 Pandas 中的稀疏类别 - 用 "Other"替换所有不在顶级类别中的内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52663432/

上一篇：python - 相当于 `package.json' 和 `package-lock.json` 的 `pip`

下一篇：python - 关于torch.nn.DataParallel的问题

python - 根据最小订单选择独特的产品

Python2 在 Pandas DataFrame 中选择数据的速度比 Python3 快......为什么？

python - 如何根据条件python查找字符串中字母的位置

python - 如何使用 Selenium 和 Python 在网站 https ://www. virustotal.com 中找到 shadow-root(打开)中的名字字段

python - MultiIndex 的 Pandas 元组列表

python - 使用多个索引标签绘制的 Pandas

python - 将字符串转换为 DataFrame 中的 float

python - Pandas key 错误 : 'occurred at index 0'

Python键入: typed dictionary or defaultdict extending classes