python - 使用 python 请求和漂亮的汤进行网页抓取

标签 python beautifulsoup request

我正在尝试抓取此网页上的所有文章:https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/

我可以抓取第一篇文章,但需要帮助了解如何跳转到下一篇文章并抓取那里的信息。预先感谢您的支持。

import requests
from bs4 import BeautifulSoup

class Content:
    def __init__(self,url,title,body):
        self.url = url
        self.title = title
        self.body = body

def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')

# Scaping news articles from Coindesk

def scrapeCoindesk(url):
    bs = getPage(url)
    title = bs.find("h3").text
    body = bs.find("p",{'class':'desc'}).text
    return Content(url,title,body)

# Pulling the article from coindesk

url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
print ('Title:{}'.format(content.title))
print ('URl: {}\n'.format(content.url))
print (content.body)

最佳答案

您可以利用每篇文章都包含在 div.article 中这一事实来迭代它们:

def scrapeCoindesk(url):
    bs = getPage(url)
    articles = []
    for article in bs.find_all("div", {"class": "article"}):
        title = article.find("h3").text
        body = article.find("p", {"class": "desc"}).text
        article_url = article.find("a", {"class": "fade"})["href"]
        articles.append(Content(article_url, title, body))
    return articles


# Pulling the article from coindesk
url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
for article in content:
    print(article.url)
    print(article.title)
    print(article.body)
    print("-------------")

关于python - 使用 python 请求和漂亮的汤进行网页抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50254397/

相关文章:

VB.Net 401 未经授权的 HTTP Web 请求

swift - Alamofire 请求在方法内不起作用,但相同的请求在 Controller 内起作用 | swift 3.0

python - IndexError : list assignment index out of range. 如何解决这个问题?

python - Flask 路由传递参数返回错误

python - 如何在 python 中避免 e-05

python - BeautifulSoup 插入 HTML 数据属性

Node.js - 请求 : Specifying custom agent that can handle http & https

python - 将数据保存为新行,但在单个单元格中 lxml python

python - 美丽汤/lxml : Are there problems with large elements?

python - 如何使用 BeautifulSoup 抓取 Instagram