考虑以下示例。
htmlist = ['<div class="portal" role="navigation" id="p-coll-print_export">',\
'<h3>Print/export</h3>',\
'<div class="body">',\
'<ul>',\
'<li id="coll-create_a_book"><a href="/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page">Create a book</a></li>',\
'<li id="coll-download-as-rl"><a href="/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Main+Page&oldid=560327612&writer=rl">Download as PDF</a></li>',\
'<li id="t-print"><a href="/w/index.php?title=Main_Page&printable=yes" title="Printable version of this page [p]" accesskey="p">Printable version</a></li>',\
'</ul>',\
'</div>',\
'</div>',\
]
soup = __import__("bs4").BeautifulSoup("".join(htmlist), "html.parser")
for x in soup("a"):
print(x)
print(x.attrs)
print(soup.a.get_text())
我期望这个简短的脚本打印等于 x
的 a
标记,后跟 x
属性的字典(名称(作为键)和内容(作为键的值),以链接文本结尾。
输出是
<a href="/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page">Create a book</a>
{'href': '/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page'}
Create a book
<a href="/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Main+Page&oldid=560327612&writer=rl">Download as PDF</a>
{'href': '/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Main+Page&oldid=560327612&writer=rl'}
Create a book
<a accesskey="p" href="/w/index.php?title=Main_Page&printable=yes" title="Printable version of this page [p]">Printable version</a>
{'href': '/w/index.php?title=Main_Page&printable=yes', 'title': 'Printable version of this page [p]', 'accesskey': ['p']}
Create a book
我发现此输出的问题是:
print(soup.a.get_text())
位始终打印第一个标签的文本。- 在
print(x.attrs)
输出的字典中,键"href"
的值缺少&。
<
我在这里缺少什么以及如何获得所需的输出?
最佳答案
您可以使用cgi.escape
或html.escape
html 方法对 &
字符进行编码。
import html
for x in soup("a"):
print(x)
print({k:html.escape(v, False) if k == 'href' else v for k,v in x.attrs.items()})
print(x.get_text())
关于python - BS HTML 解析 - & 在打印 URL 字符串时被忽略,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45906401/