python-3.x - Python 网络抓取遗漏了搜索对象列表中的一个元素

标签 python-3.x web-scraping beautifulsoup python-requests

我正在尝试使用 Python 3.7 中的 beautifulsouprequests 库抓取一些数据。对于此网页上的每个项目(标记文章),都有一个 youtube 链接。找到 article 的所有实例后,我可以成功提取标题。此代码还成功地在每篇文章中找到了 youtube-player 类的实例,但索引 7 除外,其输出为 None

from bs4 import BeautifulSoup
import requests
url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
articles = soup.find_all('article')

for article in articles:
    headline = article.h2.a.text
    print(headline)
    link = article.find('iframe', {'class': 'youtube-player'})
    print(link)

但是,从源头(beautifulsoup 的输出)来看,如果我直接搜索 youtube-player,我会正确获取所有实例。

links = soup.find_all('iframe', {'class': 'youtube-player'})
for link in links:
    print(link)

我如何改进我的代码以获取 article 循环中的所有 youtube-player 实例?

最佳答案

您可以使用 zip() 内置函数将标题和 YouTube 链接绑定(bind)在一起。

例如:

import requests
from bs4 import BeautifulSoup

url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "html.parser")

for title, player in zip(soup.select('.entry-title'),
                         soup.select('iframe.youtube-player')):
    print('{:<75}{}'.format(title.text, player['src']))

打印:

Git: Difference between “add -A”, “add -u”, “add .”, and “add *”           https://www.youtube.com/embed/tcd4txbTtAY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Programming Terms: Combinations and Permutations                           https://www.youtube.com/embed/QI9EczPQzPQ?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Chrome Quick Tip: Quickly Bookmark Open Tabs for Later Viewing             https://www.youtube.com/embed/tsiSg_beudo?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Comprehensions – How they work and why you should be using them    https://www.youtube.com/embed/3dt4OGnU5sM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Generators – How to use them and the benefits you receive          https://www.youtube.com/embed/bD05uGo_sVI?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Quickest and Easiest Way to Run a Local Web-Server                         https://www.youtube.com/embed/lE6Y6M9xPLw?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Git for Beginners: Command-Line Fundamentals                               https://www.youtube.com/embed/HVsySz-h9r4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Time-Saving Keyboard Shortcuts for the Mac Terminal                        https://www.youtube.com/embed/TXzrk3b9sKM?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Overview of Online Learning Resources in 2015                              https://www.youtube.com/embed/QGy6M8HZSC4?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
Python: Else Clauses on Loops                                              https://www.youtube.com/embed/Dh-0lAyc3Bc?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent

编辑:似乎当您使用 html.parser 时,BeautifulSoup 在一个地方无法识别 youtube 链接,请使用 lxmlhtml5lib 改为:

import requests
from bs4 import BeautifulSoup

url = 'https://coreyms.com/page/12'
soup = BeautifulSoup(requests.get(url).text, "lxml")

for article in soup.select('article'):
    title = article.select_one('.entry-title')
    player = article.select_one('iframe.youtube-player') or {'src':''}
    print('{:<75}{}'.format(title.text, player['src']))

关于python-3.x - Python 网络抓取遗漏了搜索对象列表中的一个元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61914180/

相关文章:

python - 使用 Python 从字符串中提取链接

python - 抓取 URL 和嵌套 CSV 以与 python 结合时出现问题

python-3.x - 如何在 jupyter notebook markdown 中编写分段函数?

python - PyCharm 项目相互使用 pip 和 python 文件

java - 无法抓取标题

python - 使用 BeautifulSoup Python 进行动态链接解析

python-3.x - 使用 `pip3` 安装了 python 包,但是当我调用它时,我得到 "No module named X"

python - 从文件中读取单词并放入列表

python - 使用 Selenium Webdriver 在元素中查找元素

python - beautifulSoup 不正确嵌套 <ul> 的屏幕抓取列表