python - 将困惑的 str 替换为来自另一个数据帧的干净的 str

我有 2 组数据框，如果 df1['Fruits'] 包含 df2['Fruits'] 字符串，我想清理它

df1
Name    Fruits
--------------
Dina    Pineapple, [Y*]
Maria   PTC*, Apple
Johny   Durian, 1-6
Johny   5,6 Rambutan
Maria   Apple (Red), [Y] *
Dina    [Y] *, Peach88
Dina    Kiwi/Qiwi, PS*

df2
Fruits      tag
-------------
Apple       20
Pineapple   30
Rambutan    40
Durian      50
Apple (Red) 25
Peach88     55
Kiwi/Qiwi   25

我已经尝试过

df1.loc[df1['Fruits'].contains(df2['Fruits']),'Fruits'] = df2['Fruits']

但它显示

'Series' object has no attribute 'contains'

所以我期望得到的是

df1
Name    Fruits
--------------
Dina    Pineapple
Maria   Apple
Johny   Durian
Johny   Rambutan
Maria   Apple (Red)
Dina    Peach88
Dina    Kiwi/Qiwi

最佳答案

使用pandas.Series.str.extract:

reg = '(%s)' % '|'.join(df2['Fruits'])
# Make regex expression using df2['Fruits']
df1['Fruits'] = df1['Fruits'].str.extract(reg)

输出:

    Name     Fruits
0   Dina  Pineapple
1  Maria      Apple
2  Johny     Durian
3  Johny   Rambutan

'(%s)' % '|'.join(df2['Fruits']) 的解释:

'|'.join(df2['Fruits']):为正则表达式中的或操作创建|分隔的单词。返回菠萝|苹果|榴莲|红毛丹
(%s) % ... :这称为字符串格式化，相当于:
- str.format:'({})'.format('|'.join(df2['Fruits'])),
- 或更隐式(但更少Pythonic)'(' + '|'.join(df2['Fruits']) + ')'
- 所有这些都返回 (Apple|Pineapple|Rambutan|Durian)，一个捕获组，对于 pd.Series.str.extract 是必需的 知道要提取什么。

关于python - 将困惑的 str 替换为来自另一个数据帧的干净的 str，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56268260/

上一篇：python - Pandas - 在同一列中格式化不同格式的日期列

下一篇：python - Django 获取每个 parent 都有自己的 child 的列表

python - 在功能上，torch.multinomial 与 torch.distributions.categorical.Categorical 相同吗？

python - 如何将 '-'字符串解析到node js本地脚本？

python - Python 中的字符串替换列表

Python - 使用 FOR 循环迭代 pandas DataFrame 时，使用 IF 语句查找字符串中的子字符串

python - 用于 Python 的 Selenium : How to dump current page's HTML

UTF-16-LE 文件的 Python 字符串替换

Python 字符串模式

python - 比较 pandas 中的多个数据框时在数据框中创建列

python - 按字符串列中最后 3 个字符选择行