python-3.x - Python 网络抓取遗漏了搜索对象列表中的一个元素

我正在尝试使用 Python 3.7 中的 beautifulsoup 和 requests 库抓取一些数据。对于此网页上的每个项目(标记文章)，都有一个 youtube 链接。找到 article 的所有实例后，我可以成功提取标题。此代码还成功地在每篇文章中找到了 youtube-player 类的实例，但索引 7 除外，其输出为 None。

from bs4 import BeautifulSoup
import requests
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
articles = soup.find_all('article')

for article in articles:
    headline = article.h2.a.text
    print(headline)
    link = article.find('iframe', {'class': 'youtube-player'})
    print(link)

但是，从源头(beautifulsoup 的输出)来看，如果我直接搜索 youtube-player，我会正确获取所有实例。

links = soup.find_all('iframe', {'class': 'youtube-player'})
for link in links:
    print(link)

我如何改进我的代码以获取 article 循环中的所有 youtube-player 实例？

最佳答案

您可以使用 zip() 内置函数将标题和 YouTube 链接绑定(bind)在一起。

例如:

import requests
from bs4 import BeautifulSoup

url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")

for title, player in zip(soup.select('.entry-title'),
                         soup.select('iframe.youtube-player')):
    print('{:<75}{}'.format(title.text, player['src']))

打印:

Git: Difference between “add -A”, “add -u”, “add .”, and “add *”           https://www.youtube.com/embed/tcd4txbTtAY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Programming Terms: Combinations and Permutations                           https://www.youtube.com/embed/QI9EczPQzPQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Chrome Quick Tip: Quickly Bookmark Open Tabs for Later Viewing             https://www.youtube.com/embed/tsiSg_beudo?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Comprehensions – How they work and why you should be using them    https://www.youtube.com/embed/3dt4OGnU5sM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Generators – How to use them and the benefits you receive          https://www.youtube.com/embed/bD05uGo_sVI?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Quickest and Easiest Way to Run a Local Web-Server                         https://www.youtube.com/embed/lE6Y6M9xPLw?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Git for Beginners: Command-Line Fundamentals                               https://www.youtube.com/embed/HVsySz-h9r4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Time-Saving Keyboard Shortcuts for the Mac Terminal                        https://www.youtube.com/embed/TXzrk3b9sKM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Overview of Online Learning Resources in 2015                              https://www.youtube.com/embed/QGy6M8HZSC4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Else Clauses on Loops                                              https://www.youtube.com/embed/Dh-0lAyc3Bc?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent

编辑:似乎当您使用 html.parser 时，BeautifulSoup 在一个地方无法识别 youtube 链接，请使用 lxml 或 html5lib 改为:

import requests
from bs4 import BeautifulSoup

url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "lxml")

for article in soup.select('article'):
    title = article.select_one('.entry-title')
    player = article.select_one('iframe.youtube-player') or {'src':''}
    print('{:<75}{}'.format(title.text, player['src']))

关于python-3.x - Python 网络抓取遗漏了搜索对象列表中的一个元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61914180/

python-3.x - Python 网络抓取遗漏了搜索对象列表中的一个元素

上一篇：python - 如何找到 Numpy 数组的 M 个元素的 N 个最大乘积子数组？

下一篇：mongodb - 通过来自不同字段的表单上传多个文件并使用express multer存储到mongodb数据库中