Python 3 Beautiful Soup 用冒号查找标签

我正在尝试抓取该站点并获取两个单独的标签。这就是 html 的样子。

<url>
  <loc>
    http://link.com
  </loc>
  <lastmod>date</lastmode>
  <changefreq>daily</changefreq>
  <image:image>
   <image:loc>
    https://imagelink.com
   <image:loc>
   <image:title>Item title</image:title>
  <image:image>
</url>

我要获取的标签是 loc 和 image:title。我遇到的问题是标题标签中的冒号。我到目前为止的代码是

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

for item in soup.find_all('url'):
    print(item.loc)
    #print image title

我也试过了

print(item.title)

但这行不通

最佳答案

你应该在 "xml" mode 中解析它相反(还需要安装 lxml):

from bs4 import BeautifulSoup

data = """
<url>
  <loc>
    http://link.com
  </loc>
  <lastmod>date</lastmod>
  <changefreq>daily</changefreq>
  <image:image>
   <image:loc>
    https://imagelink.com
   </image:loc>
   <image:title>Item title</image:title>
  </image:image>
</url>"""

soup = BeautifulSoup(data, 'xml')

for item in soup.find_all('url'):
    print(item.title.get_text())

打印项目标题。

请注意，我已经对您的 XML 字符串进行了多项修复，因为它最初的格式不正确。

关于Python 3 Beautiful Soup 用冒号查找标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39934304/

上一篇：python - 如何在 Python 中打印返回请求的函数？

下一篇：python - 高速公路网络套接字超时后如何重新连接？

python - 迭代递归(在 Python 树中)

python - 使用 subprocess.Popen 应用环境变量

python - 如何在单独的进程中运行 shell 并获得自动完成？ (Python)

python - 由于 UnicodeDecodeError，解压下载的 .gz 文件失败

javascript - 无法通过 jQuery ajax 发送特殊字符

python-3.x - 如何在 ipython/jupyter 中将小部件添加到容器小部件

python - 根据子字符串的位置选择 pandas df 行

android - 将数据写入不适合的 NFC 标签

xml - 在 lisp 程序中删除标签