python - 即使我处理了它的异常,我的子进程也会无声地崩溃,没有任何错误消息

标签 python linux python-3.x parallel-processing multiprocessing

我编写了一个程序来抓取网站,由于我必须抓取大量链接,所以我使用 Python 多处理。当我的程序启动时一切都很好并且我的异常记录非常好但是在 2-3 小时后 2-3 或所有 4 个子进程使用 0% CPU 并且因为我没有使用异步我程序的最后一行记录“完毕!”字符串不执行!在我的进程池的目标函数中,我用 try/except 语句包装了所有代码,这样我的进程就不会崩溃,如果它崩溃,我必须在 nohup.log 文件中看到一些输出(我用 nohup myscript 运行这个脚本.py & 在后台!)。我不知道发生了什么,这真的让我很生气。

我在互联网上搜索并看到有人告诉我在 pool 语句之后使用 my_pool.close() (因为他说子进程不一定在他们的任务后关闭)但是它也没有用:(

我的代码大约有 200 行,所以我不能把它们都放在这里! 我给你总结一下。如果您需要某些部分的详细信息,请告诉我

from bs4 import BeautifulSoup
import requests
import urllib.request
import multiprocessing
from orator import DatabaseManager
import os
from datetime import datetime

def login():
    requests_session = requests.session()
    login_page = requests_session.get("https://www.example.com/login")
    payload = {
        "username": "XX",
        "password": "X",
    }
    response = requests_session.post("https://www.example.com/auth/eb-login", data=payload, headers=dict(referer="https://www.example.com/login"))
    if response.status_code == 200:
        return requests_session
    else:
        return False



def media_crawler(url_article_id):
    try:
        url = url_article_id[0] + "/images-videos"
        article_id = url_article_id[1]
        requests_session = url_article_id[2]
        db = DatabaseManager(config)
        page = requests_session.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')
        img_wrapper_list = soup.select("div.example")
        #Check if we are logged in
        if soup.select_one("div.example").text.strip().lower() != "logout":
            #if we not we login again 
            current_session = login()
            #if our login failed we log it and stop doing this url task
            if current_session == False:
                log = open("media.log", "a+")
                log.write(datetime.now().strftime('%H:%M:%S') + " We are not logged in and can not log in!: " 
                    "\nArticle ID: " + str(article_id)
                    + "\n----------------------------\n"
                )
                log.close()
                print("Error logged!")
                return
            #otherwise we return the new session 
            else:
                requests_session = current_session
        #we go in every image wrapper and take all the images
        for img_wrapper in img_wrapper_list:
            if not img_wrapper.has_attr("data-jw"):
                img_source = img_wrapper.select_one("div.image-wrapper.mg > img")["src"]
                image_title = img_wrapper.select_one("div.image-wrapper.mg > img")["alt"]
                file_name_with_extension = img_source.split("/")[-1]
                file_name = file_name_with_extension.split(".")[0]
                file_extension = file_name_with_extension.split(".")[-1]
                try:
                    filename, headers = urllib.request.urlretrieve(img_source, "images/" + str(article_id) + "-" + file_name + "." + file_extension)
                    file_size = int(headers["Content-Length"]) / 1024
                    #Store the file in database
                #if we got any problem in downloading and storing in
                #database we log it and delete the downloaded file(if it downloaded)
                except Exception as e:
                    log = open("media.log", "a+")
                    log.write(datetime.now().strftime('%H:%M:%S') + " Problem in fetching media: \nURL: " 
                        + img_source + "\nArticle ID: " + str(article_id) + "\n" + str(e)
                        + "\n----------------------------\n"
                    )
                    log.close()
                    print("Error logged!")
                    try:
                        os.remove("images/" + str(article_id) + "-" + file_name + "." + file_extension)
                    except:
                        pass
        #Update the image article to know which article media we download 
        try:
            db.table("articles").where('article_id', article_id).update(image_status=1)
        except Exception as e:
            log = open("media.log", "a+")
            log.write(datetime.now().strftime('%H:%M:%S') + " Problem in updating database record for: " 
                + "\nArticle ID: " + str(article_id) + "\n" + str(e)
                + "\n----------------------------\n"
            )
            log.close()
            print("Error logged!")
    #this is the try/except wrapper for my whole function
    except Exception as e:
        log = open("media.log", "a+")
        log.write(datetime.now().strftime('%H:%M:%S') + " Problem in this article media: \nURL: " 
            + "\nArticle ID: " + str(article_id) + "\n" + str(e)
            + "\n----------------------------\n"
        )
        log.close()
        print("Error logged!")
    db.disconnect()

db = DatabaseManager()

current_session = login()

if current_session:
    log = open("media.log", "w+")
    log.write("Start!\n")
    log.close()

    articles = db.table("articles").skip(0).take(1000).get()
    url_article_id_tuples_list = []
    for article in articles:
        temp = (article["article_link"], article["article_id"], current_session)
        url_article_id_tuples_list.append(temp)

    myPool = multiprocessing.Pool()
    myPool.
    myPool.map(media_crawler, url_article_id_tuples_list)

    myPool.close()

    log = open("media.log", "a+")
    log.write("\nDone!")
    log.close()

else:
    print("Can not login to the site!")

db.disconnect() 

2-3 小时后,我的进程崩溃(我认为)并且它们的 CPU 使用率达到 0%,我的最后一个命令没有执行

log.write("\nDone!")

而且我没有任何想法我 nohup.log 和 media.log 中的任何特殊内容 我真的不知道幕后发生了什么

我的日志文件错误只是关于连接,所以我处理了它们:(

Start!
03:20:31 Problem in this article media: 
URL: 
Article ID: 190830
'alt'
----------------------------
03:50:05 Problem in fetching media: 
URL: https://cdn.example.com/30/91430-004-828719A3.jpg
Article ID: 188625
<urlopen error [Errno 104] Connection reset by peer>
----------------------------
06:15:44 Problem in fetching media: 
URL: https://cdn.example.com/15/37715-004-AA71C615.jpg
Article ID: 241940
<urlopen error [Errno 104] Connection reset by peer>
----------------------------
06:23:07 Problem in this article media: 
URL: 
Article ID: 244457
HTTPSConnectionPool(host='www.example.com', port=443): Max retries exceeded with url: /biography/Dore-Schary/images-videos (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))
----------------------------
06:25:14 Problem in this article media: 
URL: 
Article ID: 248185
('Connection aborted.', OSError("(104, 'ECONNRESET')"))
----------------------------
06:28:30 Problem in fetching media: 
URL: https://cdn.example.com/89/77189-004-9D4A3E0B.jpg
Article ID: 244500
<urlopen error [Errno 104] Connection reset by peer>
----------------------------
06:39:29 Problem in fetching media: 
URL: https://cdn.example.com/50/175050-004-8ACF8167.jpg
Article ID: 244763
Remote end closed connection without response
----------------------------
06:39:39 Problem in fetching media: 
URL: https://cdn.example.com/34/201734-004-D8779144.jpg
Article ID: 244763
<urlopen error [Errno -2] Name or service not known>
----------------------------
06:39:49 Problem in fetching media: 
URL: https://cdn.example.com/60/93460-004-B2993A85.jpg
Article ID: 244763
<urlopen error [Errno -2] Name or service not known>
----------------------------
06:39:59 Problem in fetching media: 
URL: https://cdn.example.com/03/174803-004-DE7B5599.jpg
Article ID: 244763
<urlopen error [Errno -2] Name or service not known>
----------------------------
06:40:09 Problem in fetching media: 
URL: https://cdn.example.com/81/188981-004-75AB37F3.jpg
Article ID: 244763
<urlopen error [Errno -2] Name or service not known>
----------------------------
06:42:42 Problem in this article media: 
URL: 
Article ID: 248524
HTTPSConnectionPool(host='www.example.com', port=443): Max retries exceeded with url: /topic/The-Yearling-novel-by-Rawlings/images-videos (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')")))

和我崩溃的进程: (它们不完全是 0%,但随着时间的推移没有添加媒体...)

xxxxx   26137  0.1  1.6 589696 134320 ?       Sl   May07   1:45 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx   26140  0.3  1.4 379392 120064 ?       SN   May07   4:52 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx   26141  0.5  1.4 380724 121172 ?       S    May07   8:55 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx   26142  0.7  1.5 382860 123112 ?       S    May07  10:37 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx   26143  0.4  1.4 379912 120380 ?       S    May07   6:15 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx   29324  0.0  0.0  21536  1032 pts/1    S+   04:20   0:00 grep --color=auto MediaCrawler.py

最佳答案

感谢评论。 我做了一些实验,这是我发现的:

作为Sam Mason说到该站点的请求率很高,我解决这个问题的方法是在每个请求中等待 1 秒,现在程序结束查找

关于python - 即使我处理了它的异常,我的子进程也会无声地崩溃,没有任何错误消息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56036505/

相关文章:

python - 如何解决 UnicodeWarning 问题?

python - subprocess.call 要求所有参数用逗号分隔

linux - CentOS 6.7 上的 "gcc: error trying to exec ' cc1plus ': execvp: No such file or directory"

python - 使用自定义文件名进行 Airflow 记录?

python - 如何计算 pandas 数据帧每行中字符串组合的数量?

python - 为什么可变实体不能是字典键?

linux - LOAD 段的对齐不是 2MB 的倍数

python - 从单独的线程调用时,自定义用户输入提示(到对象构造函数)不会出现/偶尔出现

ubuntu - 如何在 Ubuntu 中为 Python 3 配置 PyQt4?

python - 何时以及如何将 `slice(obj)` 用于非 -`int`/`None` 类型?