Python:在关键字前后抓取文本

keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string,  i.e., a page of a book"

for k in keywords:
    if k in TEXT:
        #cut = portion of text starting 'beforeText' chars before occurrence of 'k' and ending 'afterText' chars after occurrence of 'k'
        #finalcut = 'cut' with first and last WORDS trimmed to assure starting words are not cut in the middle

伙计们，你能帮我编写上面例子中的 cut 和 finalcut 字符串变量吗？

考虑到我要处理大文本、大量页面和可能超过 20 个要搜索的关键字，最有效的解决方案是什么？

最佳答案

您可以使用 re.finditer 查找字符串中的所有匹配项.每个匹配对象都有一个 start()方法可以用来计算字符串中的位置。您也不需要检查键是否在字符串中，因为 finditer 会返回一个空迭代器:

keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string,  i.e., a page of a book"

for k in keywords:
    for match in re.finditer(k, TEXT):
        position = match.start()
        cut = TEXT[max(position - before, 0):position + after] # max is needed because that index must not be negative
        trimmed_match = re.match("\w*?\W+(.*)\W+\w*", cut, re.MULTILINE)
        finalcut = trimmed_match.group(1)

正则表达式修剪所有内容，包括第一个非单词字符序列和最后一个非单词字符序列(我添加了 re.MULTILINE 以防换行在你的文字中)

关于Python:在关键字前后抓取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25277305/

Python:在关键字前后抓取文本

上一篇：Python:比较带有重音字符的字符串不起作用

下一篇：python - 如何继承 multiprocessing.JoinableQueue