我编写了一个函数,它用具有给定属性的 HTML 元素包围搜索词。这个想法是将生成的包围字符串稍后写入日志文件,并突出显示搜索词。
def inject_html(needle, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
new_str = haystack
start_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), start_index)
end = start + len(needle)
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
start_index = end + len(html_close) + len(html_open)
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
我想打开它以接受一系列针,在大海捞针中用 HTML 定位和包围它们。我可以通过用另一个循环围绕代码来轻松地做到这一点,该循环遍历针、定位和周围的搜索词实例。问题是,这并不能防止意外包围先前注入(inject)的 HTML 代码,例如
def inject_html(needles, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
new_str = haystack
for needle in needles:
start_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), start_index)
end = start + len(needle)
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
start_index = end + len(html_close) + len(html_open)
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
search_strings = ["foo", "pan", "test"]
haystack = "Foobar"
print(inject_html(search_strings,haystack))
<s<span class="matched">pan</span> class="matched">Foo</span>bar
在第二次迭代中,代码搜索并包围上一次迭代中插入的“span”中的“pan”文本。
你会如何建议我更改我的原始功能以查找针列表,而不会有将 HTML 注入(inject)不需要的位置(例如在现有标签内)的风险。
---更新---
我通过维护“免疫”范围列表(已经被 HTML 包围因此不需要再次检查的范围)来解决这个问题。
def inject_html(needles, haystack, html_element="span", html_attrs={"class":"matched"}):
# Find all occurrences of a given string in some text
# Surround the occurrences with a HTML element and given HTML attributes
immune = []
new_str = haystack
for needle in needles:
next_index = 0
while True:
try:
# Get the bounds
start = new_str.lower().index(needle.lower(), next_index)
end = start + len(needle)
if not any([(x[0] > start and x[0] < end) or (x[1] > start and x[1] < end) for x in immune]):
# Needle is present, compose the HTML to inject
html_open = "<" + html_element + " " + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_close = "</" + html_element + ">"
new_str = new_str[0:start] + html_open + new_str[start:end] + html_close + new_str[end:len(new_str)]
next_index = end + len(html_close) + len(html_open)
# Add the highlighted range (and HTML code) to the list of immune ranges
immune.append([start, next_index])
except ValueError as ex:
# String doesn't occur in text after index, break loop
break
return new_str
虽然它不是特别 Pythonic,但我很想看看是否有人能想出更干净的东西。
最佳答案
我会使用这样的东西:
def inject_html(phrases, text_body, html_element_name="span", html_attrs={"class":"matched"}):
new_text_body = []
html_start_tag = "<" + html_element_name + " ".join(["%s=\"%s\""%(k,html_attrs[k]) for k in html_attrs]) + ">"
html_end_tag = "</" + html_element_name + ">"
text_body_lines = text_body.split("\n")
for line in text_body_lines:
for p in phrases:
if line.lower() == p.lower():
line = html_start_tag + p + html_end_tag
break
new_text_body.append(line)
return "\n".join(new_text_body)
它逐行检查并替换每一行(如果该行完全匹配(不区分大小写))。
第二轮:
根据匹配需要 (1) 不区分大小写和 (2) 在每一行匹配多个单词/短语的要求,我会使用:
import re
def inject_html(phrases, text_body, html_element_name="span", html_attrs={"class": "matched"}):
html_start_tag = "<" + html_element_name + " " + " ".join(["%s=\"%s\"" % (k, html_attrs[k]) for k in html_attrs]) + ">"
html_end_tag = "</" + html_element_name + ">"
for p in phrases:
text_body = re.sub(r"({})".format(p), r"{}\1{}".format(html_start_tag, html_end_tag), text_body, flags=re.IGNORECASE)
return text_body
对于每个提供的短语 p
,这使用不区分大小写的 re.sub()
替换来替换提供的文本中该短语的所有实例。 (p)
通过正则表达式组匹配短语。 \1
是匹配找到的短语的回填运算符,将其包含在 HTML 标记中。
text = """
Somewhat more than forty years ago, Mr Baillie Fraser published a
lively and instructive volume under the title _A Winter’s Journey
(Tatar) from Constantinople to Teheran. Political complications
had arisen between Russia and Turkey - an old story, of which we are
witnessing a new version at the present time. The English government
deemed it urgently necessary to send out instructions to our
representatives at Constantinople and Teheran.
"""
new = inject_html(["TEHERAN", "Constantinople"], text)
print(new)
> Somewhat more than forty years ago, Mr Baillie Fraser published a lively and instructive volume under the title _A Winter’s Journey (Tatar) from <span class="matched">Constantinople</span> to <span class="matched">Teheran</span>. Political complications had arisen between Russia and Turkey - an old story, of which we are witnessing a new version at the present time. The English government deemed it urgently necessary to send out instructions to our representatives at <span class="matched">Constantinople</span> and <span class="matched">Teheran</span>.
关于Python - 使用 HTML 包围给定列表中存在于另一个字符串中的字符串实例,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44634826/