python - 连接池已满,通过Selenium和Python丢弃与ThreadPoolExecutor和多个 headless 浏览器的连接

标签 python selenium threadpool threadpoolexecutor urllib3

我正在使用 selenium==3.141.0 编写一些自动化软件, python 3.6.7 , chromedriver 2.44 .

大部分逻辑都可以由单个浏览器实例执行,但对于某些部分,我必须启动 10-20 个实例才能获得不错的执行速度。

到了ThreadPoolExecutor执行的部分,浏览器交互开始抛出此错误:

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

浏览器设置:

def init_chromedriver(cls):
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument(f"user-agent={Utils.get_random_browser_agent()}")
        prefs = {"profile.managed_default_content_settings.images": 2}
        chrome_options.add_experimental_option("prefs", prefs)

        driver = webdriver.Chrome(driver_paths['chrome'],
                                       chrome_options=chrome_options,
                                       service_args=['--verbose', f'--log-path={bundle_dir}/selenium/chromedriver.log'])
        driver.implicitly_wait(10)

        return driver
    except Exception as e:
        logger.error(e)

相关代码:

ProfileParser实例化一个网络驱动程序并执行一些页面交互。我认为交互本身是不相关的,因为没有 ThreadPoolExecutor 一切都可以工作。 。 然而,简而言之:

class ProfileParser(object):
    def __init__(self, acc):
        self.driver = Utils.init_chromedriver()
    def __exit__(self, exc_type, exc_val, exc_tb):
        Utils.shutdown_chromedriver(self.driver)
        self.driver = None

    collect_user_info(post_url)
           self.driver.get(post_url)
           profile_url = self.driver.find_element_by_xpath('xpath_here')]').get_attribute('href')

运行于 ThreadPoolExecutor ,此时出现上面的错误self.driver.find_element_by_xpath或在 self.driver.get

这正在工作:

with ProfileParser(acc) as pparser:
        pparser.collect_user_info(posts[0])

这些选项不起作用: ( connectionpool errors )

futures = []
#one worker, one future
with ThreadPoolExecutor(max_workers=1) as executor:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, posts[0]))

#10 workers, multiple futures
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, p))

更新:

我找到了一个临时解决方案(这不会使这个最初的问题无效) - 实例化 webdriver ProfileParser 之外类(class)。不知道为什么它有效,但最初却不起作用。我想是某些语言细节的原因? 感谢您的回答,但问题似乎不在于 ThreadPoolExecutor max_workers限制 - 正如您在其中一个选项中看到的那样,我尝试提交单个实例,但它仍然不起作用。

当前解决方法:

futures = []
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        driver = Utils.init_chromedriver()
        futures.append({
            'future': executor.submit(collect_user_info, driver, acc, p),
            'driver': driver
        })

for f in futures:
    f['future'].done()
    Utils.shutdown_chromedriver(f['driver'])

最佳答案

此错误消息...

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

...似乎是 urllib3 中的一个问题的连接池在执行 def _put_conn(self, conn) 时引发了这些警告 connectionpool.py中的方法。

def _put_conn(self, conn):
    """
    Put a connection back into the pool.

    :param conn:
        Connection object for the current host and port as returned by
        :meth:`._new_conn` or :meth:`._get_conn`.

    If the pool is already full, the connection is closed and discarded
    because we exceeded maxsize. If connections are discarded frequently,
    then maxsize should be increased.

    If the pool is closed, then the connection will be closed and discarded.
    """
    try:
        self.pool.put(conn, block=False)
        return  # Everything is dandy, done.
    except AttributeError:
        # self.pool is None.
        pass
    except queue.Full:
        # This should never happen if self.block == True
        log.warning(
            "Connection pool is full, discarding connection: %s",
            self.host)

    # Connection never got put back into the pool, close it.
    if conn:
        conn.close()
<小时/>

ThreadPoolExecutor

ThreadPoolExecutorExecutor使用线程池异步执行调用的子类。当与 Future 关联的可调用对象等待另一个 Future 的结果时,可能会发生死锁。

class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='', initializer=None, initargs=())
  • Executor 子类,使用最多 max_workers 线程池来异步执行调用。
  • 初始化程序是一个可选的可调用对象,在每个工作线程启动时调用; initargs 是传递给初始化器的参数元组。如果初始化程序引发异常,所有当前待处理的作业以及任何向池中提交更多作业的尝试都将引发 BrokenThreadPool。
  • 从版本 3.5 开始:如果 max_workers 为 None 或未给定,则默认为机器上的处理器数量乘以 5,假设 ThreadPoolExecutor 通常用于重叠 I/O 而不是 CPU 工作和数量工作人员数量应高于 ProcessPoolExecutor 的工作人员数量。
  • 从版本 3.6 开始:添加了 thread_name_prefix 参数,以允许用户控制线程。池创建的工作线程的线程名称,以便于调试。
  • 从版本 3.7 开始:添加了初始化程序和 initargs 参数。

根据您的问题,当您尝试启动 10-20 个实例时,默认连接池大小 10 在您的情况下似乎不够,这是硬编码的adapters.py

此外,@EdLeafe 在讨论中Getting error: Connection pool is full, discarding connection提及:

It looks like within the requests code, None objects are normal. If _get_conn() gets None from the pool, it simply creates a new connection. It seems odd, though, that it should start with all those None objects, and that _put_conn() isn't smart enough to replace None with the connection.

但是合并Add pool size parameter to client constructor已经解决了这个问题。

解决方案

默认连接池大小增加为10,该大小之前已硬编码在adapters.py中现在可配置将解决您的问题。

<小时/>

更新

根据您的评论更新...提交单个实例,结果是相同的...。根据讨论中的 @meferguson84 Getting error: Connection pool is full, discarding connection :

I stepped into the code to the point where it mounts the adapter just to play with the pool size and see if it made a difference. What I found was that the queue is full of NoneType objects with the actual upload connection being the last item in the list. The list is 10 items long (which makes sense). What doesn't make sense is that the unfinished_tasks parameter for the pool is 11. How can this be when the queue itself is only 11 items? Also, is it normal for the queue to be full of NoneType objects with the connection we are using being the last item on the list?

这听起来像是您的用例中的一个可能原因。这可能听起来多余,但您仍然可以执行一些临时步骤,如下所示:

关于python - 连接池已满,通过Selenium和Python丢弃与ThreadPoolExecutor和多个 headless 浏览器的连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53641068/

相关文章:

python - 在 FiPy 中使用扫描函数时的求解器容差和残差

python - namedtuple return 和它的 typename 参数有什么区别?

selenium - 如何识别几秒钟内消失的通知元素?

java.security.AccessControlException : Access denied (java. lang.RuntimePermission 修改线程)

java - Spring任务执行器安排了太多的任务实例

python - 通过 Django REST Framework 中的多个参数反向 URL

python - 如何使用元组扩展集合?

java - 如何在 Java 测试框架中隐藏页面对象初始化

python - Beautifulsoup 使用页面源代码片段创建 Soup

c# - Unity 中的 PerThreadLifetimeManager