Python 的请求会触发 Cloudflare 的安全性,而 urllib 不会

标签 python python-3.x web-scraping python-requests

我正在为一家餐厅网站开发自动网络爬虫,但遇到了问题。所述网站使用 cloudlfare 的反机器人安全,我想绕过它,不是攻击模式,而是验证码测试,只有在检测到非美国 IP 或机器人时才会触发。我试图绕过它,因为当我清除 cookie、禁用 javascript 或当我使用美国代理时,cloudflare 的安全性不会触发。
知道了这一点,我尝试使用 python 的请求库:

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
response = requests.get("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers).text
print(response)
但这最终会触发 Cloudflare,无论我使用什么代理。
然而使用具有相同 header 的 urllib.request 时:
import urllib.request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
request = urllib.request.Request("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers)
r = urllib.request.urlopen(request).read()
print(r.decode('utf-8'))
当使用相同的美国 IP 运行时,这次它不会触发 Cloudflare 的安全性,即使它使用与请求库相同的 header 和 IP。
因此,我试图找出不在 urllib 库中的请求库中究竟是什么触发了 cloudflare。
虽然典型的答案是“然后只使用 urllib”,但我想弄清楚请求到底有什么不同,以及我如何解决它,首先要了解请求的工作原理和 cloudflare 检测机器人的方式,但也是如此我可以将我能找到的任何修复应用到其他 httplib(特别是异步的)
编辑 N°2:到目前为止的进展:
感谢@TuanGeek,我们现在可以使用请求绕过 cloudflare 块,只要我们直接连接到主机 IP 而不是域名(出于某种原因,请求的 DNS 重定向会触发 cloudflare,但 urllib 不会):
import requests
from collections import OrderedDict
import socket

# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
s = requests.Session()
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", verify=False).text
注意:尝试通过 http(而不是 https 且验证变量设置为 False)访问将触发 cloudflare 的块
现在这很棒,但不幸的是,我的最终目标仍然没有实现,即使用 httplib HTTPX 异步完成这项工作,因为使用以下代码,即使我们直接通过主机 IP 连接,仍会触发 cloudflare 块,使用正确的标题,并将验证设置为 False:
import trio
import httpx
import socket
from collections import OrderedDict
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
async def asks_worker():
    async with httpx.AsyncClient(headers=headers, verify=False) as s:
        r = await s.get(f'https://{address}/guest/accountlogin')
        print(r.text)
async def run_task():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(asks_worker)
trio.run(run_task)
编辑 N°1:有关其他详细信息,这里是来自 urllib 和来自请求的原始 http 请求
要求:
send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Date: Thu, 02 Jul 2020 20:20:06 GMT
header: Content-Type: text/html; charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: close
header: CF-Chl-Bypass: 1
header: Set-Cookie: __cfduid=df8902e0b19c21b364f3bf33e0b1ce1981593721256; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Expires: Thu, 01 Jan 1970 00:00:01 GMT
header: X-Frame-Options: SAMEORIGIN
header: cf-request-id: 03b2c8d09300000ca181928200000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=df8962e1b27c25b364f3bf66e8b1ce1981593723206; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Vary: Accept-Encoding
header: Server: cloudflare
header: CF-RAY: 5acb25c75c981ca1-EWR
网址库:
send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 02 Jul 2020 20:20:01 GMT
header: Content-Type: text/html;charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Set-Cookie: __cfduid=db9de9687b6c22e6c12b33250a0ded3251292457801; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Expires: Thu, 2 Jul 2020 20:20:01 GMT
header: Cache-Control: no-cache, private, no-store
header: X-Powered-By: Undertow/1
header: Pragma: no-cache
header: X-Frame-Options: SAMEORIGIN
header: Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://use.typekit.net connect.facebook.net/ https://googleads.g.doubleclick.net/ app.pendo.io cdn.pendo.io pendo-static-6351154740266000.storage.googleapis.com pendo-io-static.storage.googleapis.com https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.google.com/recaptcha/api.js apis.google.com https://www.googletagmanager.com api.instagram.com https://app-rsrc.getbee.io/plugin/BeePlugin.js https://loader.getbee.io api.instagram.com https://bat.bing.com/bat.js https://www.googleadservices.com/pagead/conversion.js https://connect.facebook.net/en_US/fbevents.js  https://connect.facebook.net/ https://fonts.googleapis.com/ https://ssl.gstatic.com/ https://tagmanager.google.com/;style-src 'unsafe-inline' *;img-src * data:;connect-src 'self' app.pendo.io api.feedback.us.pendo.io; frame-ancestors 'self' app.pendo.io pxsweb.com *.pxsweb.com;frame-src 'self' *.myguestaccount.com https://app.getbee.io/ *;
header: X-Lift-Version: Unknown Lift Version
header: CF-Cache-Status: DYNAMIC
header: cf-request-id: 01b2c5b1fa00002654a25485710000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Server: cloudflare
header: CF-RAY: 5acb58a62c5b5144-EWR

最佳答案

这真的激起了我的兴趣。 requests我能够开始工作的解决方案。
解决方案
最后缩小问题范围。当您使用请求时,它使用 urllib3 连接池。常规 urllib3 连接和连接池之间似乎存在一些不一致。一个有效的解决方案:

import requests
from collections import OrderedDict
from requests import Session
import socket

# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]

s = Session()
headers = OrderedDict({
    'Accept-Encoding': 'gzip, deflate, br',
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)
技术背景
所以我通过 Burp Suite 运行这两种方法来比较请求。以下是请求的原始转储
使用请求
GET /guest/accountlogin HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Connection: close
Host: grimaldis.myguestaccount.com
Accept-Language: en-GB,en;q=0.5
Upgrade-Insecure-Requests: 1
dnt: 1


使用 urllib
GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: close
Upgrade-Insecure-Requests: 1
Dnt: 1


不同之处在于标题的顺序。 dnt的区别大写实际上不是问题。
因此,我能够通过以下原始请求成功提出请求:
GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0


所以Host header 已在上面发送 User-Agent .所以如果你想继续使用请求。考虑使用 OrderedDict 来确保标题的顺序。

关于Python 的请求会触发 Cloudflare 的安全性,而 urllib 不会,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62684468/

相关文章:

java - Python 2.7 selenium webdriver 无法读取 java 站点上的表内容

internationalization - 使用 PyQt4 进行国际化的最佳实践

python-3.x - 如何在两个矩阵的比较中允许任何列的符号差异,Python?

python - 使用 python 隐藏部分的网页抓取表

python - 确保 Python 计算器万无一失

python - {m,n} 吗?正则表达式实际上最大限度地减少了重复,还是最大限度地减少了匹配的字符数?

python - Python/Chrome/Java (linux mint) 的段错误

Python 内存不足(使用后缀树)

python - 在不抑制和操纵 pylint 设置的情况下控制 "too many local variable in a function"的最佳做法是什么?

javascript - 需要 JavaScript 支持的页面上的 cURL 请求