我有这个 df:
df = pd.DataFrame.from_dict(
{
'Name': ['Jane', 'Melissa', 'John', 'Matt', 'Abernethy', 'Annie', 'Brook', 'Brian', 'Carrie'],
'Tag': ['tag1,tag2', 'tag1', 'tag4,tag3,tag7', 'tag2,tag9', 'tag1,tag3', 'tag3,tag4', 'tag9,tag2', 'tag3,tag2', 'tag1,tag5'],
}
)
看起来像这样:
我的目标是创建第三列“Tag_after”。简单的 SQL case 语句是:
UPDATE table SET Tag_after =
CASE
WHEN Tag LIKE '%tag1%' THEN 'tag1'
WHEN Tag LIKE '%tag2%' THEN 'tag2'
WHEN Tag LIKE '%tag3%' THEN 'tag3'
WHEN Tag LIKE '%tag4%' THEN 'tag4'
WHEN Tag LIKE '%tag5%' THEN 'tag5'
WHEN Tag LIKE '%tag_wrong1%' THEN 'tag_right1'
ELSE Tag
END
tag1 has a higher priority than tag2, and so on
tag_wrong1 will be changed to tag_right1
期望的输出是这样的:
我的(错误的)try1:
import pandas as pd
tag_1 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_wrong1', 'tag9']
tag_2 = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6', 'tag7', 'tag8', 'tag_right1', 'tag9']
df['Tag_after'] = ''
def set_visitor_tag(df, tag_before, tag_after, col_tag, add_col_tag):
i = 0
while i < len(tag_before):
df.loc[~df[col_tag].isnull() & df[col_tag].str.contains(tag_before[i]), [add_col_tag]] = tag_after[i]
i = i + 1
set_visitor_tag(df, tag_1, tag_2, 'Tag', 'Tag_after')
这个处理的结果和我设置的权限优先级不一样
我觉得函数对每一行数据做了多次匹配和赋值操作, 我想要的是一行处理一次后,就不再处理了。
我的(错误的)try2:
def set_visitor_tag(df, tag_before, tag_after, col_tag, add_col_tag):
i = 0
while i < len(tag_before):
if tag_before[i] in df[col_tag]:
df.loc[df[col_tag].str.contains(tag_before[i]), [add_col_tag]] = tag_after[i]
else:
continue
i = i + 1
非常感谢。
最佳答案
一种选择是使用 case_when来自 pyjanitor 的功能,这类似于 SQL 的情况:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df.case_when(
df.Tag.str.contains('tag1'), 'tag1', # condition, result
df.Tag.str.contains('tag2'), 'tag2',
df.Tag.str.contains('tag3'), 'tag3',
df.Tag.str.contains('tag4'), 'tag4',
df.Tag.str.contains('tag5'), 'tag5',
df.Tag.str.contains('tag_wrong1'), 'tag_right1',
df.Tag, # default if none of the conditions evaluate to True
column_name = 'Tag_after')
Out[11]:
Name Tag Tag_after
0 Jane tag1,tag2 tag1
1 Melissa tag9,tag_wrong1 tag_right1
2 John tag4,tag3,tag7 tag3
3 Matt tag2,tag9 tag2
4 Abernethy tag1,tag3 tag1
5 Annie tag3,tag4,tag5 tag3
6 Brook tag9,tag2 tag2
7 Brian tag3,tag2 tag2
8 Carrie tag1,tag5 tag1
另一种选择是使用 numpy 的 select功能:
condlist = [df.Tag.str.contains('tag1'), df.Tag.str.contains('tag2'),
df.Tag.str.contains('tag3'), df.Tag.str.contains('tag4'),
df.Tag.str.contains('tag5'), df.Tag.str.contains('tag_wrong1') ]
choicelist = ['tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag_right1']
df.assign(Tag_after = np.select(condlist, choicelist, df.Tag))
Name Tag Tag_after
0 Jane tag1,tag2 tag1
1 Melissa tag9,tag_wrong1 tag_right1
2 John tag4,tag3,tag7 tag3
3 Matt tag2,tag9 tag2
4 Abernethy tag1,tag3 tag1
5 Annie tag3,tag4,tag5 tag3
6 Brook tag9,tag2 tag2
7 Brian tag3,tag2 tag2
8 Carrie tag1,tag5 tag1
关于python - Pandas 等同于创建新列的 SQL case when 语句,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69924628/