python - 使用 BeautifulSoup 4 在 Python 中进行多重处理的问题

我正在使用大多数或所有核心来更快地处理文件，它可以一次读取多个文件或使用多个核心来读取单个文件。

我更喜欢使用多个核心来读取单个文件，然后再将其移动到下一个文件。

我尝试了下面的代码，但似乎无法用完所有核心。

以下代码基本上会以 json 格式检索包含 htmls 的目录中的 *.txt 文件。

   #!/usr/bin/python
    # -*- coding: utf-8 -*-
    import requests
    import json
    import urlparse
    import os
    from bs4 import BeautifulSoup
    from multiprocessing.dummy import Pool  # This is a thread-based Pool
    from multiprocessing import cpu_count

    def crawlTheHtml(htmlsource):
        htmlArray = json.loads(htmlsource)
        for eachHtml in htmlArray:
            soup = BeautifulSoup(eachHtml['result'], 'html.parser')
            if all(['another text to search' not in str(soup),
                   'text to search' not in str(soup)]):
                try:
                    gd_no = ''
                    try:
                        gd_no = soup.find('input', {'id': 'GD_NO'})['value']
                    except:
                        pass

                    r = requests.post('domain api address', data={
                        'gd_no': gd_no,
                        })
                except:
                    pass


    if __name__ == '__main__':
        pool = Pool(cpu_count() * 2)
        print(cpu_count())
        fileArray = []
        for filename in os.listdir(os.getcwd()):
            if filename.endswith('.txt'):
                fileArray.append(filename)
        for file in fileArray:
            with open(file, 'r') as myfile:
                htmlsource = myfile.read()
                results = pool.map(crawlTheHtml(htmlsource), f)

除此之外，我不确定 ,f 代表什么。

问题 1:

我没有正确执行哪些操作来充分利用所有核心/线程？

问题 2:

是否有更好的方法来使用 try : except : 因为有时该值不在页面中，这会导致脚本停止。当处理多个变量时，我最终会得到很多 try & except 语句。

最佳答案

回答问题1，你的问题是这一行:

from multiprocessing.dummy import Pool  # This is a thread-based Pool

答案取自:multiprocessing.dummy in Python is not utilising 100% cpu

当您使用 multiprocessing.dummy 时，您使用的是线程，而不是进程:

multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.

这意味着您受到 Global Interpreter Lock (GIL) 的限制，并且一次只有一个线程可以实际执行 CPU 密集型操作。这将使您无法充分利用 CPU。如果您希望在所有可用核心之间获得完全并行性，则需要解决使用 multiprocessing.Pool 遇到的 pickling 问题。

关于python - 使用 BeautifulSoup 4 在 Python 中进行多重处理的问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50178900/

python - 使用 BeautifulSoup 4 在 Python 中进行多重处理的问题

上一篇：python - 将图像写入文件夹

下一篇：python - 求 Petersen 子图中的哈密顿路径