python - python中模式匹配时如何从文本中获取单词大小写

我有一个包含两列 Stg 和 Txt 的数据框。任务是检查 Stg Column 中每个 Txt 行的所有单词，并将匹配的单词输出到新列中，同时保持 Txt 中的单词大小写。

Example Code:

from pandas import DataFrame

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = DataFrame(new,columns= ['Stg','Txt'])

my_list = df["Stg"].tolist()
import re

def words_in_string(word_list, a_string):
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    for found_word in re.finditer(pattern, a_string):
        word = found_word.group(0)
        if word in word_set:
            word_set.discard(word)
            yield word
            if not word_set:
                raise StopIteration 

df['new'] = ''

for i,values in enumerate(df['Txt']):
    a=[]
    b = []
    for word in words_in_string(my_list, values):
        a=word
        b.append(a)
    df['new'][i] = b
    exit

上面的代码从 Stg 列返回案例。有没有办法从 Txt 中获取案例。另外，我想检查整个字符串，而不是像文本“two-way”的情况那样检查子字符串，当前代码返回单词 way。

Current Output:

    Stg            Txt                                   new
0   way           An early term                           []
1   Early         two-way allowed                         [way, allowed]
2   phone         New Phone feature that allowed          [allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]


Expected Output:

    Stg            Txt                                   new
0   way           An early term                           [early]
1   Early         two-way allowed                         [allowed]
2   phone         New Phone feature that allowed          [Phone, allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]

最佳答案

您应该使用Series.str.findall带有负向回顾:

import pandas as pd
import re

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = pd.DataFrame(new,columns= ['Stg','Txt'])

pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in new["Stg"])

df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)

print (df)

#
          Stg                             Txt               new
0         way                   An early term           [early]
1       Early                 two-way allowed         [allowed]
2       phone  New Phone feature that allowed  [Phone, allowed]
3     allowed                amazing universe                []
4        type                         new day                []
5  brand name         the brand name is stage      [brand name]

关于python - python中模式匹配时如何从文本中获取单词大小写，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58245174/

python - python中模式匹配时如何从文本中获取单词大小写

上一篇：python - 使用 Max 函数时 Set 和 List 有什么区别？

下一篇：python - Pandas to_datetime错误 'unconverted data remains'