我试图将每篇文章的内容保存在自己的文本文件中。我遇到的问题是想出一个 beautiful soup
返回 News
类型文章的方法仅在忽略其他文章类型的情况下。
相关网站:https://www.nature.com/nature/articles
信息
- 每篇文章都包含在一对
<article>
中标签 - 每种文章类型都隐藏在
<span>
中包含data-test
的标签属性为article.type
值。 - 文章标题位于
<a>
内带有data-track-label="link"
的标签属性。 - 文章正文包裹在
<div>
中标签(在类属性中查找“body”)。
当前代码
我能够查询 <span>
对于 News
的文章类型,但我正在努力采取后续步骤来返回其他文章的特定信息。
我怎样才能更进一步?对于 News
类型的文章,我也希望能够返回该文章的 title
和body
同时忽略其他不属于 News
类型的文章?
# Send HTTP requests
import requests
from bs4 import BeautifulSoup
class WebScraper:
@staticmethod
def get_the_source():
# Obtain the URL
url = 'https://www.nature.com/nature/articles'
# Get the webpage
r = requests.get(url)
# Check response object's status code
if r:
the_source = open("source.html", "wb")
soup = BeautifulSoup(r.content, 'html.parser')
type_news = soup.find_all("span", string='News')
for i in type_news:
print(i.text)
the_source.write(r.content)
the_source.close()
print('\nContent saved.')
else:
print(f'The URL returned {r.status_code}!')
WebScraper.get_the_source()
新闻类型文章的示例 HTML
源代码还有其他 19 篇具有相似和不同文章类型的文章。
<article class="u-full-height c-card c-card--flush" itemscope itemtype="http://schema.org/ScholarlyArticle">
<div class="c-card__image">
<picture>
<source
type="image/webp"
srcset="
//media.springernature.com/w165h90/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 160w,
//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 290w"
sizes="
(max-width: 640px) 160px,
(max-width: 1200px) 290px,
290px">
<img src="//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg"
alt=""
itemprop="image">
</picture>
</div>
<div class="c-card__body u-display-flex u-flex-direction-column">
<h3 class="c-card__title" itemprop="name headline">
<a href="/articles/d41586-021-00485-2"
class="c-card__link u-link-inherit"
itemprop="url"
data-track="click"
data-track-action="view article"
data-track-label="link">Mars arrivals and Etna eruption — February's best science images</a>
</h3>
<div class="c-card__summary u-mb-16 u-hide-sm-max"
itemprop="description">
<p>The month’s sharpest science shots, selected by <i>Nature's</i> photo team.</p>
</div>
<div class="u-mt-auto">
<ul data-test="author-list" class="c-author-list c-author-list--compact u-mb-4">
<li itemprop="creator" itemscope="" itemtype="http://schema.org/Person"><span itemprop="name">Emma Stoye</span></li>
</ul>
<div class="c-card__section c-meta">
<span class="c-meta__item c-meta__item--block-at-xl" data-test="article.type">
<span class="c-meta__type">News</span>
</span>
<time class="c-meta__item c-meta__item--block-at-xl" datetime="2021-03-05" itemprop="datePublished">05 Mar 2021</time>
</div>
</div>
</div>
</article>
</div>
</li>
<li class="app-article-list-row__item">
<div class="u-full-height" data-native-ad-placement="false">
最佳答案
最简单的方法是将新闻作为参数添加到查询字符串中,每次点击都会获得更多结果
https://www.nature.com/nature/articles?type=news
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nature.com/nature/articles?type=news')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item')
for n in news_articles:
print(n.select_one('.c-card__link').text)
新闻第 2 页的各种参数:
https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&type=news&page=2
如果您在页面上手动过滤时监控浏览器网络选项卡,或者 选择不同的页码,您可以看到查询字符串的构造逻辑并相应地定制您的请求,例如
https://www.nature.com/nature/articles?type=news&year=2021
否则,您可以根据 article
节点是否具有包含“News”(包含)的特定子节点,使用 css 选择器进行更复杂的(内/外)包含;排除存在 带有另一个单词/符号的新闻(根据类别列表):
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nature.com/nature/articles')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item:has(.c-meta__type:contains("News"):not( \
:contains("&"), \
:contains("in"), \
:contains("Career"), \
:contains("Feature")))') #exclusion n
for n in news_articles:
print(n.select_one('.c-card__link').text)
如果您想要新闻或新闻等...,您可以从 :not() 列表中删除类别
关于python - Python 中带有过滤功能的漂亮汤查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66523253/