keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string, i.e., a page of a book"
for k in keywords:
if k in TEXT:
#cut = portion of text starting 'beforeText' chars before occurrence of 'k' and ending 'afterText' chars after occurrence of 'k'
#finalcut = 'cut' with first and last WORDS trimmed to assure starting words are not cut in the middle
伙计们,你能帮我编写上面例子中的 cut
和 finalcut
字符串变量吗?
考虑到我要处理大文本、大量页面和可能超过 20 个要搜索的关键字,最有效的解决方案是什么?
最佳答案
您可以使用 re.finditer
查找字符串中的所有匹配项.每个匹配对象都有一个 start()
方法可以用来计算字符串中的位置。您也不需要检查键是否在字符串中,因为 finditer
会返回一个空迭代器:
keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string, i.e., a page of a book"
for k in keywords:
for match in re.finditer(k, TEXT):
position = match.start()
cut = TEXT[max(position - before, 0):position + after] # max is needed because that index must not be negative
trimmed_match = re.match("\w*?\W+(.*)\W+\w*", cut, re.MULTILINE)
finalcut = trimmed_match.group(1)
正则表达式修剪所有内容,包括第一个非单词字符序列和最后一个非单词字符序列(我添加了 re.MULTILINE
以防换行在你的文字中)
关于Python:在关键字前后抓取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25277305/