python - 如何防止引发 asyncio.TimeoutError 并继续循环

标签 python exception python-asyncio aiohttp timeoutexception

我使用aiohttp和limited_as_completed方法来加速抓取(大约1亿个静态网站页面)。但是,代码会在几分钟后停止,并返回 TimeoutError。我尝试了几种方法,但仍然无法阻止引发 asyncio.TimeoutError 。请问如何忽略该错误并继续?

我正在运行的代码是:

N=123
import html
from lxml import etree
import requests
import asyncio 
import aiohttp
from aiohttp import ClientSession, TCPConnector
import pandas as pd
import re 
import csv 
import time
from itertools import islice
import sys
from contextlib import suppress

start = time.time()
data = {}
data['name'] = []
filename = "C:\\Users\\xxxx"+ str(N) + ".csv"

def limited_as_completed(coros, limit):
    futures = [
        asyncio.ensure_future(c)
        for c in islice(coros, 0, limit)
    ]
    async def first_to_finish():
        while True:
            await asyncio.sleep(0)
            for f in futures:
                if f.done():
                    futures.remove(f)
                    try:
                        newf = next(coros)
                        futures.append(
                            asyncio.ensure_future(newf))
                    except StopIteration as e:
                        pass
                    return f.result()
    while len(futures) > 0:
        yield first_to_finish()

async def get_info_byid(i, url, session):
    async with session.get(url,timeout=20) as resp:
        print(url)
        with suppress(asyncio.TimeoutError):
            r = await resp.text()
            name = etree.HTML(r).xpath('//h2[starts-with(text(),"Customer Name")]/text()')
            data['name'].append(name)
            dataframe = pd.DataFrame(data)
            dataframe.to_csv(filename, index=False, sep='|')

limit = 1000
async def print_when_done(tasks):
    for res in limited_as_completed(tasks, limit):
        await res

url = "http://xxx.{}.html"
loop = asyncio.get_event_loop()

async def main():
    connector = TCPConnector(limit=10)
    async with ClientSession(connector=connector,headers=headers,raise_for_status=False) as session:
        coros = (get_info_byid(i, url.format(i), session) for i in range(N,N+1000000))
        await print_when_done(coros)

loop.run_until_complete(main())
loop.close()
print("took", time.time() - start, "seconds.")

错误日志是:

Traceback (most recent call last):
  File "C:\Users\xxx.py", line 111, in <module>
    loop.run_until_complete(main())
  File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 573, in run_until_complete
    return future.result()
  File "C:\Users\xxx.py", line 109, in main
    await print_when_done(coros)
  File "C:\Users\xxx.py", line 98, in print_when_done
    await res
  File "C:\Users\xxx.py", line 60, in first_to_finish
    return f.result()
  File "C:\Users\xxx.py", line 65, in get_info_byid
    async with session.get(url,timeout=20) as resp:
  File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\site-packages\aiohttp\client.py", line 855, in __aenter__
    self._resp = await self._coro
  File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\site-packages\aiohttp\client.py", line 391, in _request
    await resp.start(conn)
  File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\site-packages\aiohttp\client_reqrep.py", line 770, in start
    self._continue = None
  File "C:\Users\xx\AppData\Local\Programs\Python\Python37-32\lib\site-packages\aiohttp\helpers.py", line 673, in __exit__
    raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError

我已经尝试过了 1)添加期望asyncio.TimeoutError:通过。不工作

async def get_info_byid(i, url, session):
    async with session.get(url,timeout=20) as resp:
        print(url)
        try:
            r = await resp.text()
            name = etree.HTML(r).xpath('//h2[starts-with(text(),"Customer Name")]/text()')
            data['name'].append(name)
            dataframe = pd.DataFrame(data)
            dataframe.to_csv(filename, index=False, sep='|')
        except asyncio.TimeoutError:
            pass

2) 抑制(asyncio.TimeoutError),如上所示。不工作

我昨天刚刚学习了aiohttp,所以也许我的代码中有其他问题导致运行几分钟后才出现超时错误?如果有人知道如何处理,非常感谢!

最佳答案

@Yurii Kramarenko 所做的事情肯定会引发未关闭的客户端 session 异常,因为 session 从未正确关闭过。我推荐的是这样的:

import asyncio
import aiohttp

async def main(urls):
    async with aiohttp.ClientSession(timeout=self.timeout) as session:
        tasks=[self.do_something(session,url) for url in urls]
        await asyncio.gather(*tasks)

关于python - 如何防止引发 asyncio.TimeoutError 并继续循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53049523/

相关文章:

mongodb - 运行时错误 : Task attached to a different loop

python - 如何在 Python 中使用 `async for`?

python - 有没有一种聪明的方法来使用 Django ORM 获取上一个/下一个项目?

android - "ANR"是异常还是错误还是什么?

python - Python 如何使用 Windows Media Player 打开并播放 mp3 文件列表

java - 使用多个文本字段进行正确的异常处理

c++ - 类未注册 0x80040154

python - 在 Python 中使用 asyncio 并行化 Web 任务

python - 使用 operator.itemgetter() 作为排序键时,有没有办法转换值?

python - 将 DuckDB 与 s3 一起使用?