python - 如何在 while 循环中处理意外的 json 响应

我正在构建一个 python 脚本，它根据我的数据库中提供的用户列表从 Instagram 收集数据。但是，我在尝试处理意外的 JSON 响应时遇到了一些问题。

为了提供一些上下文，该程序正在从我的数据库表中获取用户名(24/7，遍历数百个帐户 - 因此 while True: 循环)，请求具有该用户名的 URL ，并期待某个 JSON 响应(具体来说，它正在响应中寻找 ['entry_data']['ProfilePage'][0])。但是，当在 Instagram 上找不到用户名时，JSON 会有所不同，并且预期的部分 (['entry_data']['ProfilePage'][0]) 不在其中。所以我的脚本崩溃了。

使用当前代码:

def get_username_from_db():
    try:
        with connection.cursor() as cursor:
            cursor.execute("SELECT * FROM ig_users_raw WHERE `username` IS NOT NULL ORDER BY `ig_users_raw`.`last_checked` ASC LIMIT 1")
            row = cursor.fetchall()
            username = row[0]['username']
    except pymysql.IntegrityError:
        print('ERROR: ID already exists in PRIMARY KEY column')
    return username

def request_url(url):
    try:
        response = requests.get(url)
    except requests.HTTPError:
        raise requests.HTTPError(f'Received non 200 status code from {url}')
    except requests.RequestException:
        raise requests.RequestException
    else:
        return response.text

def extract_json_data(url):
    try:
        r = requests.get(url, headers=headers)
    except requests.HTTPError:
        raise requests.HTTPError('Received non-200 status code.')
    except requests.RequestException:
        raise requests.RequestException
    else:
        print(url)
        soup = BeautifulSoup(r.content, "html.parser")
        scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
        stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
        j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
        return j

if __name__ == '__main__':
    while True:
        sleep(randint(5,15))
        username = get_username_from_db()
        url = f'https://www.instagram.com/{username}/'
        j = extract_json_data(url)
        json_string = json.dumps(j)
        user_id = j['graphql']['user']['id']
        username = j['graphql']['user']['username']
        #print(user_id)
        try:
            with connection.cursor() as cursor:
                db_data = (json_string, datetime.datetime.now(),user_id)
                sql = "UPDATE `ig_users_raw` SET json=%s, last_checked=%s WHERE `user_id`= %s "
                cursor.execute(sql, db_data)
                connection.commit()
                print(f'{datetime.datetime.now()} - data inserted for user: {user_id} - {username}')
        except pymysql.Error:
            print('ERROR: ', pymysql.Error)

我收到以下错误/回溯:

https://www.instagram.com/geloria.itunes/
Traceback (most recent call last):
  File "D:\Python\Ministry\ig_raw.py", line 63, in <module>
    j = extract_json_data(url)
  File "D:\Python\Ministry\ig_raw.py", line 55, in extract_json_data
    j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
  File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

理想情况下，我希望它跳过帐户(在本例中为 geloria.itunes)，然后转到数据库中的下一个帐户。我可能想删除该帐户，或者至少从该行中删除用户名。

为了自己解决这个问题，我尝试了 if/else 循环，但在它继续的情况下，我只是在同一个帐户上循环。

您对我如何解决这个具体问题有什么建议吗？

谢谢!

最佳答案

首先要弄清楚异常发生的原因。

你得到这个错误的原因是因为你告诉 json解析无效(非 JSON)字符串。

只需使用您在回溯中提供的 URL 运行此示例:

import re
import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.instagram.com/geloria.itunes/")
print(r.status_code)  # outputs 404(!)

soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]

print(stringified_json)
# j = json.loads(stringified_json)  # will raise an exception

输出:

\n(function(){\n function normalizeError(err) {\n... ... stringify(normalizedError));\n })\n }\n })\n}());

如你所见stringified_json不是有效的 JSON 字符串。

如您所述，它无效，因为此 instagram 页面已隐藏或不存在(HTTP 状态代码为 404 Not Found)。而且您将错误的响应传递给了 json.loads()因为您没有检查脚本中的响应状态代码。

以下except子句没有捕捉到“404 案例”，因为您收到了有效的 HTTP 响应，因此没有异常(exception):

except requests.HTTPError:
    raise requests.HTTPError('Received non-200 status code.')
except requests.RequestException:
    raise requests.RequestException

所以基本上你有两种方法来处理这个问题:

手动检查响应 HTTP 状态代码，如 if r.status_code != 200 ...
或使用 raise_for_status() method如果 400 <= r.status_code < 600 则抛出异常

I might want to remove the account, or at least remove the username from the row.

嗯，你的问题听起来有点含糊。我只能提供一个想法。

例如 - 如果遇到 404 页面，您可以 raise处理响应时的自定义异常，稍后在 __main__ 中捕获它, 从数据库中删除记录并继续其他页面:

class NotFoundError(Exception):
    """ my custom exception for not found pages """
    pass

...  # other functions

def extract_json_data(url):
    r = requests.get(url, headers=headers)
    if r.status_code == 404:
        raise NotFoundError()  # page not found

    # if any other error occurs (network unavailable for example) - an exception will be raised

    soup = BeautifulSoup(r.content, "html.parser")
    scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
    stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
    return json.loads(stringified_json)['entry_data']['ProfilePage'][0]

if __name__ == '__main__':
    while True:
        sleep(randint(5, 15))
        username = get_username_from_db()
        url = f'https://www.instagram.com/{username}/'
        try:
            j = extract_json_data(url)
        except NotFoundError:
            delete_user_from_db(username)  # implement: DELETE FROM t WHERE username = ...
            continue  # proceed for next user page

        # rest of your code:
        # json_string = json.dumps(j)
        # user_id = j['graphql']['user']['id']
        # ...

关于python - 如何在 while 循环中处理意外的 json 响应，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56811776/

python - 如何在 while 循环中处理意外的 json 响应

上一篇：python - Pandas 适用于参数列表

下一篇：python - 优雅地停止线程