我正在编写一个正则表达式来获取 ""
之间的数据.我遇到的唯一问题是最后一个 "
正在被捕获。 Regex
line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
capture_regex = re.compile(r'(?<=HREF=").*?"',re.IGNORECASE)
m = capture_regex.search(line)
m.group()
版画 https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html"
.如何编写不包含最后一个引号的正则表达式。
回答了我的问题。我补充说我在我的正则表达式中添加了所谓的非贪婪。
capture_regex = re.compile(r'(?<=HREF=").*?(?=")',re.IGNORECASE)
.通过添加 ?
在 *
之后让它只停在第一个 "
.
也许,来自 bs4 的 find_all
可能工作正常:
from bs4 import BeautifulSoup
line = '<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>'
soup = BeautifulSoup(line, 'html.parser')
for l in soup.find_all('a', href=True):
print(l['href'])
输出
https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html
如果不是,也许,一些类似的表达
(?i)href="\s*([^\s"]*?)\s*"
with re.findall
可能在这里工作:
import re
expression = r'(?i)href="\s*([^\s"]*?)\s*"'
string = """
<DT><A HREF="https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html" ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>
<DT><A HREF=" https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html " ADD_DATE="1567455957">Clickjacking Defense · OWASP Cheat Sheet Series</A>
"""
print(re.findall(expression, string))
输出
['https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html', 'https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html']
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.