python - 在Python中通过正则表达式优化在两个列表之间查找匹配子字符串

标签 python regex string list match

这是我通过包含“单词”的列表搜索来查找包含“短语”的列表中的子字符串的方法,并返回在包含短语的列表中的每个元素中找到的匹配子字符串。

import re

def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']

to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
print(to_be_appended)

# (desired and actual) output
[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

由于“words”(或 list_to_search)列表有约 1700 个单词,“phrases”(或 list_to_be_searched)列表有约 26561 个单词,因此需要 30 多分钟才能完成代码。我不认为我上面的代码是考虑到 Pythonic 的编码方式和高效的数据结构而实现的。 :(

有人可以提供一些优化或加快速度的建议吗?

谢谢!

其实我上面的例子写错了。 如果“list_to_search”包含超过 2 个单词的元素怎么办?

import re

def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']

to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
print(to_be_appended)
# (desired and actual) output
[['hello my'],
 ['name', 'is'],
 ['name', 'is'],
 [],
 ['name', 'is', 'is your name', 'your'],
 ['name', 'is']]

时机 第一种方法:

%%timeit
def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

    list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
    list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
#43.2 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

第二种方法(嵌套列表理解和 re.findall)

%%timeit
[[j for j in list_to_search if j in re.findall(r"\b{}\b".format(j), i)] for i in list_to_be_searched]
#40.3 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\

时间确实有所改善,但还有更快的方法吗?或者,考虑到该任务的作用,其遗传速度很慢?

最佳答案

您可以使用嵌套列表理解:

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name',
                       'how are you', 'what is your name', 'my name is jane doe']

[[j for j in list_to_search if j in i.split()] for i in list_to_be_searched]

[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

关于python - 在Python中通过正则表达式优化在两个列表之间查找匹配子字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55220207/

相关文章:

python - 在 Ubuntu 中运行打包的 Python 程序

python - 为什么 hstack() 复制数据而 hsplit() 在其上创建 View ?

javascript - 使用正则表达式提取 javascript 中的特定标签格式

java - Guava、Files.readLines() 和空白

string - 没有重复字符的最长子字符串出现边缘情况

python - 错误 : PerfectSeparationError: Perfect separation detected, 结果不可用

python - 多个按钮更改多个标签的颜色 TKINTER、PYTHON?

c# - 如何匹配从给定索引开始的正则表达式?

javascript - 匹配第 n 个字符之前和之后的组

vb.net - For Each 循环通过 DataGridViewColumn 标题