我想根据发现的特定关键字字符串创建类别,而不是分配的类别“其他”。
例如 - 如果在列中找到“health”,则将该关键字行命名为“HEALTH”,如果是“therapy”则命名为“THERAPIST”
- 通过代码创建“类别”列
- 根据条件分配类别
我能够通过创建表格并使用索引匹配在 Excel 上执行此操作,并且希望切换到 Python 以将其应用于大型数据集,
下面是示例数据,
最佳答案
您可以对所有关键字使用正则表达式。然后,根据您想要获得第一个匹配项还是所有匹配项,使用 extract
或extractall
分别进行聚合。
我添加了关键字“private”作为示例,以查看第 3 行中的差异:
import re
words = ['health', 'therapist', 'sales', 'private']
regex = '|'.join(map(re.escape, words))
# 'health|therapist|sales|private'
# option 1: get first match
df['category_first'] = (df['keyword']
.str.extract(f'(?i)({regex})', expand=False)
.fillna('other')
)
# option 2: get all matches
df['category_all'] = (df['keyword']
.str.extractall(f'(?i)({regex})')
[0].groupby(level=0).agg(','.join)
.reindex(df.index, fill_value='other')
)
print(df)
输出:
keyword category category_first category_all
0 HR Consultancy UK-d-uk-159_bing other other other
1 it support COMPANY LONDON-D-UK-G1161_bing other other other
2 global sales training platform openings sales sales sales
3 tele private practice therapist therapist private private,therapist
4 asset grant management system other other other
5 digital team project management solution openings other other other
6 global training platform openings other other other
7 tele practice therapist therapist therapist therapist
8 global sales training platform openings sales sales sales
9 tele health practice health health health
10 asset grant management other other other
11 digital team project management solution other other other
关于python - 查找列中的部分文本,如果找到 true 则传递通过反射(reflect)分配的文本值而不是 true 或 false 创建新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73019833/