python - 无法在基于 asyncio 构建的脚本中使用 https 代理以及重用同一 session

我正在尝试使用 asyncio 库在异步请求中使用 https 代理。在使用http代理时，有明确的说明here但如果使用 https 代理，我就会陷入困境。此外，我想重用同一个 session ，而不是每次发送请求时都创建一个新 session 。

到目前为止我已经尝试过(脚本中使用的代理直接从免费代理网站获取，因此将它们视为占位符):

import asyncio
import aiohttp
from bs4 import BeautifulSoup

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

async def get_text(url):
    global proxies,proxy_url
    while True:
        check_url = proxy_url
        proxy = f'http://{proxy_url}'
        print("trying using:",check_url)
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(url,proxy=proxy,ssl=False) as resp:
                    return await resp.text()
            except Exception:
                if check_url == proxy_url:
                    proxy_url = proxies.pop()

async def field_info(field_link):              
    text = await get_text(field_link)          
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

if __name__ == '__main__':
    proxy_url = proxies.pop()
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
    loop.run_until_complete(future)
    loop.close()

如何在脚本中使用 https 代理并重用相同的 session ？

最佳答案

此脚本创建字典proxy_session_map，其中键是代理，值是 session 。这样我们就知道哪个代理属于哪个 session 。

如果使用代理时出现一些错误，我会将此代理添加到 disabled_proxies 设置中，这样我就不会再次使用此代理:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

from random import choice

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

disabled_proxies = set()

proxy_session_map = {}

async def get_text(url):
    while True:
        try:
            available_proxies = [p for p in proxies if p not in disabled_proxies]

            if available_proxies:
                proxy = choice(available_proxies)
            else:
                proxy = None

            if proxy not in proxy_session_map:
                proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))

            print("trying using:",proxy)

            async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
                return await resp.text()

        except Exception as e:
            if proxy:
                print("error, disabling:",proxy)
                disabled_proxies.add(proxy)
            else:
                # we haven't used proxy, so return empty string
                return ''


async def field_info(field_link):
    text = await get_text(field_link)
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

async def main():
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    tasks = [field_info(url) for url in links]

    await asyncio.gather(
        *tasks
    )

    # close all sessions:
    for s in proxy_session_map.values():
        await s.close()

if __name__ == '__main__':
    asyncio.run(main())

打印(例如):

trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrape instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?

... and so on.

关于python - 无法在基于 asyncio 构建的脚本中使用 https 代理以及重用同一 session ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62356159/

python - 无法在基于 asyncio 构建的脚本中使用 https 代理以及重用同一 session

上一篇：python - 将一组 n 个整数转换为一个 n 个整数列表的时间复杂度是多少？

下一篇：python - Pandas:根据分组总和结果与另一列中的值的比较来修改每组中最后一个单元格的值