python - 无法在基于 asyncio 构建的脚本中使用 https 代理以及重用同一 session

标签 python python-3.x web-scraping python-asyncio aiohttp

我正在尝试使用 asyncio 库在异步请求中使用 https 代理。在使用http代理时,有明确的说明here但如果使用 https 代理,我就会陷入困境。此外,我想重用同一个 session ,而不是每次发送请求时都创建一个新 session 。

到目前为止我已经尝试过(脚本中使用的代理直接从免费代理网站获取,因此将它们视为占位符):

import asyncio
import aiohttp
from bs4 import BeautifulSoup

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

async def get_text(url):
    global proxies,proxy_url
    while True:
        check_url = proxy_url
        proxy = f'http://{proxy_url}'
        print("trying using:",check_url)
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(url,proxy=proxy,ssl=False) as resp:
                    return await resp.text()
            except Exception:
                if check_url == proxy_url:
                    proxy_url = proxies.pop()

async def field_info(field_link):              
    text = await get_text(field_link)          
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

if __name__ == '__main__':
    proxy_url = proxies.pop()
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(asyncio.gather(*(field_info(url) for url in links)))
    loop.run_until_complete(future)
    loop.close()

如何在脚本中使用 https 代理并重用相同的 session

最佳答案

此脚本创建字典proxy_session_map,其中键是代理,值是 session 。这样我们就知道哪个代理属于哪个 session 。

如果使用代理时出现一些错误,我会将此代理添加到 disabled_proxies 设置中,这样我就不会再次使用此代理:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

from random import choice

proxies = [
    'http://89.22.210.191:41258',
    'http://91.187.75.48:39405',
    'http://103.81.104.66:34717',
    'http://124.41.213.211:41828',
    'http://93.191.100.231:3128'
]

disabled_proxies = set()

proxy_session_map = {}

async def get_text(url):
    while True:
        try:
            available_proxies = [p for p in proxies if p not in disabled_proxies]

            if available_proxies:
                proxy = choice(available_proxies)
            else:
                proxy = None

            if proxy not in proxy_session_map:
                proxy_session_map[proxy] = aiohttp.ClientSession(timeout = aiohttp.ClientTimeout(total=5))

            print("trying using:",proxy)

            async with proxy_session_map[proxy].get(url,proxy=proxy,ssl=False) as resp:
                return await resp.text()

        except Exception as e:
            if proxy:
                print("error, disabling:",proxy)
                disabled_proxies.add(proxy)
            else:
                # we haven't used proxy, so return empty string
                return ''


async def field_info(field_link):
    text = await get_text(field_link)
    soup = BeautifulSoup(text,'lxml')
    for item in soup.select(".summary .question-hyperlink"):
        print(item.get_text(strip=True))

async def main():
    links = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50".format(page) for page in range(2,5)]
    tasks = [field_info(url) for url in links]

    await asyncio.gather(
        *tasks
    )

    # close all sessions:
    for s in proxy_session_map.values():
        await s.close()

if __name__ == '__main__':
    asyncio.run(main())

打印(例如):

trying using: http://89.22.210.191:41258
trying using: http://124.41.213.211:41828
trying using: http://124.41.213.211:41828
error, disabling: http://124.41.213.211:41828
trying using: http://93.191.100.231:3128
error, disabling: http://124.41.213.211:41828
trying using: http://103.81.104.66:34717
BeautifulSoup to get image name from P class picture tag in Python
Scrape instagram public information from google cloud functions [duplicate]
Webscraping using R - the full website data is not loading
Facebook Public Data Scraping
How it is encode in javascript?

... and so on.

关于python - 无法在基于 asyncio 构建的脚本中使用 https 代理以及重用同一 session ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62356159/

相关文章:

java - 开发万物谷歌的平台?

python - django.db.utils.IntegrityError : NOT NULL constraint failed 错误

linux - 用两个程序读取一个串口 : Python 3

python-3.x - 如何使用 Dask 比较两个大型 CSV 文件

javascript - 如何正确使用getElementByXpath和getElementsByXpath?

python - 如何在 Anaconda python 发行版中安装 Rodeo IDE?

python - 想要检查存储库是否存在以及是否是公共(public)的(gitpython)

python-3.x - python 请求使用验证码登录

python - 在循环中第二次调用“driver.get(url)”后,Selenium出现“错误:已超过最大重试次数”,其中“url”的值每次迭代都会更改

python - 如何使用 SCRAPY 向 API 发出 POST 请求