我正在尝试抓取登录表单中包含 CSRF __RequestAccessToken 的网站。我能够从表单中获取 token 值,将其放入 header 并使用 cookie 发布,但我收到 500 状态代码。 result.text 包含一条消息:“抱歉,处理您的请求时发生错误。”和“我们的网站使用只有现代浏览器才有的功能。为了获得最佳体验,我们建议将您的浏览器升级到其中之一”
正如我所说,我可以从表单中获取 token 值并将其放置在 header 中。还检索了 cookie 并将其邮寄出去。当我使用 Chrome 浏览器手动提供凭据时,我的登录有效。 不知道接下来要尝试什么。任何人都可以建议出什么问题吗?提前致谢。
这是我手动登录时 Chrome 浏览器中显示的 header :
GET /Security/Register HTTP/1.1
'Host': 'www.idocmarket.com',
'Connection': 'keep-alive'
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,
*/*;q=0.8,application/signed-exchange;v=b3',
'Sec-Fetch-Site': 'same-origin',
'Referer': ' https://www.idocmarket.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cookie': ' __utmz=141398122.1569340638.1.1.utmcsr=(direct)|utmccn= (direct)|utmcmd=(none); ASP.NET_SessionId=aow5a3q4o0kfdhwu554ma2qt;
__utmc=141398122;
__RequestVerificationToken=XsNRpnUzlge1NCeddExuVaN_uYheGBROrEHHNLgY5oTxc5HZqVZrXKmnn2IgUquL_tM-uWaebglLrfEpdGIutLYAFdK5EzQGOFeyiz3PszQ1; __utma=141398122.1343771318.1569340638.1570400178.1570490801.9; party_search_type=Contains; __utmb=141398122.19.10.1570490801'
这是我的代码:
import requests
from datetime import datetime
from bs4 import BeautifulSoup
LOGIN_URL = 'https://www.idocmarket.com/Security/Register'
EMAIL = 'myemail@gmail.com'
PASSWORD = 'somepwd'
LOGIN_API_URL = 'https://www.idocmarket.com/Security/Register'
def main():
# Persistent login session
session_requests = requests.session()
# Get login auth token
result = session_requests.get(LOGIN_URL)
cookies = result.cookies
soup = BeautifulSoup(result.content, "html.parser")
auth_token = soup.find("input", {'name': '__RequestVerificationToken'}).get('value')
# Create payload
payload = {
"Login_Username": EMAIL,
"Login_Password": PASSWORD
}
headerpayload = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.idocmarket.com',
'Origin': 'https://www.idocmarket.com/Security/Register',
'Referer': 'https://www.idocmarket.com/',
'__RequestVerificationToken': auth_token
}
# Perform login
result = session_requests.post(
LOGIN_API_URL,
data=payload,
headers=headerpayload,
cookies=cookies
)
# Report successful login
print("Login succeeded: ", result.ok)
print("Status code:", result.status_code)
print(result.text)
# Entry point
if __name__ == '__main__':
main()
我希望登录后转到下一页
最佳答案
看来这个网站在请求中的 cookies 和参数中都设置了 xsrf token (正如您首先注意到的那样)。经过一些测试后,cookie 本身似乎就足够了,例如,从 POST 请求中删除 xsrf 参数,同时传递正确的 cookie 似乎可行。
使用curl用于快速测试:
username=myemail@gmail.com
password=somepwd
curl -s -c cookies.txt 'https://www.idocmarket.com/Security/LogOn'
curl -v -b cookies.txt -L 'https://www.idocmarket.com/Security/LogOn' \
-d "Login.Username=$username&Login.Password=$password"
并使用 python :
import requests
from bs4 import BeautifulSoup
LOGIN_URL = 'https://www.idocmarket.com/Security/LogOn'
EMAIL = 'myemail@gmail.com'
PASSWORD = 'somepwd'
s = requests.Session()
s.get(LOGIN_URL)
r = s.post(LOGIN_URL, data = {
"Login.Username": EMAIL,
"Login.Password": PASSWORD
})
soup = BeautifulSoup(r.text, "html.parser")
关于python 网页抓取登录与 __RequestAccessToken 不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58358410/