我正在构建一个 python 脚本,它根据我的数据库中提供的用户列表从 Instagram 收集数据。但是,我在尝试处理意外的 JSON 响应时遇到了一些问题。
为了提供一些上下文,该程序正在从我的数据库表中获取用户名(24/7,遍历数百个帐户 - 因此 while True:
循环),请求具有该用户名的 URL ,并期待某个 JSON 响应(具体来说,它正在响应中寻找 ['entry_data']['ProfilePage'][0]
)。
但是,当在 Instagram 上找不到用户名时,JSON 会有所不同,并且预期的部分 (['entry_data']['ProfilePage'][0]
) 不在其中。所以我的脚本崩溃了。
使用当前代码:
def get_username_from_db():
try:
with connection.cursor() as cursor:
cursor.execute("SELECT * FROM ig_users_raw WHERE `username` IS NOT NULL ORDER BY `ig_users_raw`.`last_checked` ASC LIMIT 1")
row = cursor.fetchall()
username = row[0]['username']
except pymysql.IntegrityError:
print('ERROR: ID already exists in PRIMARY KEY column')
return username
def request_url(url):
try:
response = requests.get(url)
except requests.HTTPError:
raise requests.HTTPError(f'Received non 200 status code from {url}')
except requests.RequestException:
raise requests.RequestException
else:
return response.text
def extract_json_data(url):
try:
r = requests.get(url, headers=headers)
except requests.HTTPError:
raise requests.HTTPError('Received non-200 status code.')
except requests.RequestException:
raise requests.RequestException
else:
print(url)
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
return j
if __name__ == '__main__':
while True:
sleep(randint(5,15))
username = get_username_from_db()
url = f'https://www.instagram.com/{username}/'
j = extract_json_data(url)
json_string = json.dumps(j)
user_id = j['graphql']['user']['id']
username = j['graphql']['user']['username']
#print(user_id)
try:
with connection.cursor() as cursor:
db_data = (json_string, datetime.datetime.now(),user_id)
sql = "UPDATE `ig_users_raw` SET json=%s, last_checked=%s WHERE `user_id`= %s "
cursor.execute(sql, db_data)
connection.commit()
print(f'{datetime.datetime.now()} - data inserted for user: {user_id} - {username}')
except pymysql.Error:
print('ERROR: ', pymysql.Error)
我收到以下错误/回溯:
https://www.instagram.com/geloria.itunes/
Traceback (most recent call last):
File "D:\Python\Ministry\ig_raw.py", line 63, in <module>
j = extract_json_data(url)
File "D:\Python\Ministry\ig_raw.py", line 55, in extract_json_data
j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\thoma\AppData\Local\Programs\Python\Python36-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
理想情况下,我希望它跳过帐户(在本例中为 geloria.itunes
),然后转到数据库中的下一个帐户。我可能想删除该帐户,或者至少从该行中删除用户名。
为了自己解决这个问题,我尝试了 if/else
循环,但在它继续的情况下,我只是在同一个帐户上循环。
您对我如何解决这个具体问题有什么建议吗?
谢谢!
最佳答案
首先要弄清楚异常发生的原因。
你得到这个错误的原因是因为你告诉 json
解析无效(非 JSON)字符串。
只需使用您在回溯中提供的 URL 运行此示例:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.instagram.com/geloria.itunes/")
print(r.status_code) # outputs 404(!)
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
print(stringified_json)
# j = json.loads(stringified_json) # will raise an exception
输出:
\n(function(){\n function normalizeError(err) {\n... ... stringify(normalizedError));\n })\n }\n })\n}());
如你所见stringified_json
不是有效的 JSON 字符串。
如您所述,它无效,因为此 instagram 页面已隐藏或不存在(HTTP 状态代码为 404 Not Found
)。而且您将错误的响应传递给了 json.loads()
因为您没有检查脚本中的响应状态代码。
以下except
子句没有捕捉到“404 案例”,因为您收到了有效的 HTTP 响应,因此没有异常(exception):
except requests.HTTPError:
raise requests.HTTPError('Received non-200 status code.')
except requests.RequestException:
raise requests.RequestException
所以基本上你有两种方法来处理这个问题:
- 手动检查响应 HTTP 状态代码,如
if r.status_code != 200 ...
- 或使用
raise_for_status()
method如果400 <= r.status_code < 600
则抛出异常
I might want to remove the account, or at least remove the username from the row.
嗯,你的问题听起来有点含糊。我只能提供一个想法。
例如 - 如果遇到 404 页面,您可以 raise
处理响应时的自定义异常,稍后在 __main__
中捕获它, 从数据库中删除记录并继续其他页面:
class NotFoundError(Exception):
""" my custom exception for not found pages """
pass
... # other functions
def extract_json_data(url):
r = requests.get(url, headers=headers)
if r.status_code == 404:
raise NotFoundError() # page not found
# if any other error occurs (network unavailable for example) - an exception will be raised
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
return json.loads(stringified_json)['entry_data']['ProfilePage'][0]
if __name__ == '__main__':
while True:
sleep(randint(5, 15))
username = get_username_from_db()
url = f'https://www.instagram.com/{username}/'
try:
j = extract_json_data(url)
except NotFoundError:
delete_user_from_db(username) # implement: DELETE FROM t WHERE username = ...
continue # proceed for next user page
# rest of your code:
# json_string = json.dumps(j)
# user_id = j['graphql']['user']['id']
# ...
关于python - 如何在 while 循环中处理意外的 json 响应,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56811776/