python - python中模式匹配时如何从文本中获取单词大小写

标签 python regex pandas case-insensitive

我有一个包含两列 Stg 和 Txt 的数据框。任务是检查 Stg Column 中每个 Txt 行的所有单词,并将匹配的单词输出到新列中,同时保持 Txt 中的单词大小写。

Example Code:

from pandas import DataFrame

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = DataFrame(new,columns= ['Stg','Txt'])

my_list = df["Stg"].tolist()
import re

def words_in_string(word_list, a_string):
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    for found_word in re.finditer(pattern, a_string):
        word = found_word.group(0)
        if word in word_set:
            word_set.discard(word)
            yield word
            if not word_set:
                raise StopIteration 

df['new'] = ''

for i,values in enumerate(df['Txt']):
    a=[]
    b = []
    for word in words_in_string(my_list, values):
        a=word
        b.append(a)
    df['new'][i] = b
    exit

上面的代码从 Stg 列返回案例。有没有办法从 Txt 中获取案例。另外,我想检查整个字符串,而不是像文本“two-way”的情况那样检查子字符串,当前代码返回单词 way。

Current Output:

    Stg            Txt                                   new
0   way           An early term                           []
1   Early         two-way allowed                         [way, allowed]
2   phone         New Phone feature that allowed          [allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]


Expected Output:

    Stg            Txt                                   new
0   way           An early term                           [early]
1   Early         two-way allowed                         [allowed]
2   phone         New Phone feature that allowed          [Phone, allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]

最佳答案

您应该使用Series.str.findall带有负向回顾:

import pandas as pd
import re

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = pd.DataFrame(new,columns= ['Stg','Txt'])

pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in new["Stg"])

df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)

print (df)

#
          Stg                             Txt               new
0         way                   An early term           [early]
1       Early                 two-way allowed         [allowed]
2       phone  New Phone feature that allowed  [Phone, allowed]
3     allowed                amazing universe                []
4        type                         new day                []
5  brand name         the brand name is stage      [brand name]

关于python - python中模式匹配时如何从文本中获取单词大小写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58245174/

相关文章:

python - 如何组织应用引擎应用程序

java - 在 Java 中使用 REGEX 从字符串中提取标记

python - 从 pandas 数据框中提取多列的组合

python - 使用 pandas 根据纪元时间每天和每周对数据进行分组

python - 深度优先搜索运行时测量

python - 将嵌套字典展平为列表列表

pandas - 在 pandas 数据框中填充值并移动列

javascript - 匹配特殊字符之间的最后一个字符串

php - 计算单词在文本中的出现次数

python - numpy数组中的np.nan是否占用内存?