python - 抓取多个页面时经常出现 HTTP 错误 413

标签 python pandas web-scraping beautifulsoup runtime-error

我正在通过循环遍历在网站上搜索我感兴趣的关键字时返回的多个页面来从 Wykop.pl(“波兰的 Reddit”)上抓取帖子。我编写了一个循环来迭代每个页面的目标内容;但是,循环将在某些页面(始终)终止,并出现错误“HTTP 错误 413:请求实体太大”。

我尝试单独抓取有问题的页面,但相同的错误消息不断重复出现。为了解决这个问题,我必须手动设置范围来收集数据,但代价是错过大量数据,我想知道是否有一个 Pythonic 解决方案来处理这个错误。我还尝试了更长的暂停时间,因为也许我冒着发送太多请求的风险,但情况似乎并非如此。

from time import sleep
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
from mtranslate import translate
from IPython.core.display import clear_output


from mtranslate import translate
posts = []
votes = []
dates = []
images = []
users = []

start_time = time()
requests = 0
pages = [str(i) for i in range(1,10)]

for page in pages:
    url = "https://www.wykop.pl/szukaj/wpisy/smog/strona/" + page + "/"
    response = get(url)

    # Pause the loop
    sleep(randint(8,15))

        # Monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))
    # Break the loop if the number of requests is greater than expected
    if requests > 10:
        warn('Number of requests was greater than expected.')
        break


    soup = BeautifulSoup(response.text, 'html.parser')
    results = soup.find_all('li', class_="entry iC")


    for result in results:
            # Error handling
            try:

                post = result.find('div', class_="text").text
                post = translate(post,'en','auto')
                posts.append(post)

                date = result.time['title']
                dates.append(date)

                vote = result.p.b.span.text
                vote = int(vote)
                votes.append(vote)

                user = result.div.b.text
                users.append(user)

                image = result.find('img',class_='block lazy')
                images.append(image)

            except AttributeError as e:
                print(e)

如果我可以一次运行所有脚本,我会将范围设置为 1 到 163(因为我有 163 页的帖子结果提到了我感兴趣的关键字)。因此,我必须设置较小的范围来逐步收集数据,但同样以丢失数据页为代价。

作为应急措施,我的另一种选择是从指定的有问题的页面中刮取作为桌面上下载的 html 文档。

最佳答案

您可能遇到了某种 IP 地址限制。运行脚本时,它对我来说工作得很好,没有任何速率限制(目前)。不过,我建议您使用 requests.Session() (您需要更改 requests 变量,否则它会覆盖导入)。这有助于减少可能的内存泄漏问题。

例如:

from bs4 import BeautifulSoup
from time import sleep
from time import time
from random import randint
import requests

posts = []
votes = []
dates = []
images = []
users = []

start_time = time()
request_count = 0
req_sess = requests.Session()

for page_num in range(1, 100):
    response = req_sess.get(f"https://www.wykop.pl/szukaj/wpisy/smog/strona/{page_num}/")

    # Pause the loop
    #sleep(randint(1,3))

    # Monitor the requests
    request_count += 1
    elapsed_time = time() - start_time
    print('Page {}; Request:{}; Frequency: {} requests/s'.format(page_num, request_count, request_count/elapsed_time))
    
    #clear_output(wait = True)
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        print('Request: {}; Status code: {}'.format(requests, response.status_code))
        print(response.headers)
    
    # Break the loop if the number of requests is greater than expected
    #if requests > 10:
    #    print('Number of requests was greater than expected.')
    #    break

    soup = BeautifulSoup(response.text, 'html.parser')
    results = soup.find_all('li', class_="entry iC")

    for result in results:
        # Error handling
        try:
            post = result.find('div', class_="text").text
            #post = translate(post,'en','auto')
            posts.append(post)

            date = result.time['title']
            dates.append(date)

            vote = result.p.b.span.text
            vote = int(vote)
            votes.append(vote)

            user = result.div.b.text
            users.append(user)

            image = result.find('img',class_='block lazy')
            images.append(image)

        except AttributeError as e:
            print(e)
            

给出以下输出:

Page 1; Request:1; Frequency: 1.246137372973911 requests/s
Page 2; Request:2; Frequency: 1.3021880233774552 requests/s
Page 3; Request:3; Frequency: 1.2663757427416629 requests/s
Page 4; Request:4; Frequency: 1.1807827876080845 requests/s                
.
.
.
Page 96; Request:96; Frequency: 0.8888853607003809 requests/s
Page 97; Request:97; Frequency: 0.8891876183362001 requests/s
Page 98; Request:98; Frequency: 0.888801819672809 requests/s
Page 99; Request:99; Frequency: 0.8900784741536467 requests/s                

当我从更高的页码开始时,这也很有效。理论上,当您收到 413 错误状态代码时,它现在应该显示响应 header 。根据RFC 7231 ,服务器应返回一个 Retry-After header 字段,您可以使用该字段来确定在下一个请求之前要后退多长时间。

关于python - 抓取多个页面时经常出现 HTTP 错误 413,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55582979/

相关文章:

python - Opencv 级联分类而不是检测

python - 将 fork 的 python 进程的输出重定向到管道

python - 在 keras 中使用自定义 tensorflow ops

python - BeautifulSoup 不会返回页面上的所有元素

python - Numpy 与 Python 的 Decimal 参数

python - 通过应用涉及相同行元素的函数来更新数据框的元素

python - 适当系列的列表数据框

python - 在Python中搜索和分割带有特殊字符的字符串

python - Selenium 远程网络驱动程序错误

python - 使用 Pandas 合并数据框