Python Beautiful Soup 提取 HTML 元数据

标签 python html twitter web-scraping beautifulsoup

我遇到了一些我不太理解的奇怪行为。我希望有人能解释发生了什么。

考虑这个元数据:

<meta property="og:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">

此行成功找到所有“og”属性并返回一个列表。

opengraphs = doc.html.head.findAll(property=re.compile(r'^og'))

但是，这条线无法为 Twitter 卡片做同样的事情。

twitterCards = doc.html.head.findAll(name=re.compile(r'^twitter'))

为什么第一行成功找到所有“og”(opengraph cards)，但找不到twitter cards？

最佳答案

问题是 name= 有特殊含义。它用于查找标签名称 - 在您的代码中它是 meta

您必须添加 "meta" 并使用包含 "name" 的字典

不同项目的示例。

from bs4 import BeautifulSoup
import re

data='''
<meta property="og:title" content="This is the Tesla Semi truck">
<meta property="twitter:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">
'''

head = BeautifulSoup(data)

print(head.findAll(property=re.compile(r'^og'))) # OK
print(head.findAll(property=re.compile(r'^tw'))) # OK

print(head.findAll(name=re.compile(r'^meta'))) # OK
print(head.findAll(name=re.compile(r'^tw')))   # empty

print(head.findAll('meta', {'name': re.compile(r'^tw')})) # OK

关于Python Beautiful Soup 提取 HTML 元数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47852373/

上一篇：html - 舍入图像并使用 CSS 定位它

下一篇：html - 具有可编辑颜色的棋盘状背景

相关文章：

python - 线性回归梯度

python - 没有名为 mem_profile 的模块

python - 需要调整大小并替换 Amazon S3 上的数百万张图片

twitter - 如何在本地测试twitter API？

ios - TWRequest 代表应用程序调用？

python - 在 Django Rest Framework 中验证 query_params

javascript - 尝试反转单击功能的效果会显示所有内容

html - 第一个输入的 CSS 选择器

javascript - jQuery 和 Bootstrap : Warning on console - Failed resource

html - 如何在 Twitter Bootstrap 中删除不需要的边距