python - 如何使用 sucuri 保护来抓取网站

标签 python scrapy cloudflare

根据 Scrapy Documetions 的说法,我想从多个站点抓取和抓取数据,我的代码可以在普通网站上正常工作,但是当我想抓取带有 Sucuri 的网站时,我没有得到任何数据,它似乎 sucuri 防火墙阻止我访问网站标记。

目标网站是 http://www.dwarozh.net/ 和 这是我的蜘蛛代码片段

from scrapy import Spider
from scrapy.selector import Selector
import scrapy

from Stack.items import StackItem
from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser


    class StackSpider(Spider):
        name = "stack"
        start_urls = [
            "http://www.dwarozh.net/sport/",
        ]


        def parse(self, response):
            mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
            for mItem in mItems:
                item = StackItem()
                item['title'] = mItem.xpath(
                    'a/h2/text()').extract_first()
                item['url'] = mItem.xpath(
                    'viewa/@href').extract_first()
                yield item

这是我得到的响应结果

<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='cz0iMHNlYyIuc3Vic3RyKDAsMSkgKyAnNXlCMicuc3Vic3RyKDMsIDEpICsgJycgKycnKyIxIi5zbGljZSgwLDEpICsgJ2pQYycuY2hhckF0KDIpKyJmIiArICIiICsnbz1jJy5jaGFyQXQoMikrICcnICsgCiI0Ii5zbGljZSgwLDEpICsgJ0FvPzcnLnN1YnN0cigzLCAxKSArIjUiICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDIpICsgIiIgKycxJyArICAgJycgKyAKIjFzZWMiLnN1YnN0cigwLDEpICsgICcnICsnJysnMycgKyAgImUiLnNsaWNlKDAsMSkgKyAiIiArImZzdSIuc2xpY2UoMCwxKSArICIiICsiMnN1Y3VyIi5jaGFyQXQoMCkrICcnICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAgJycgKyI5c3UiLnNsaWNlKDAsMSkgKyAgJycgKycnKyI2IiArICdDYycuc2xpY2UoMSwyKSsiNnN1Ii5zbGljZSgwLDEpICsgJ2YnICsgICAnJyArIAonYScgKyAgIjAiICsgJ2YnICsgICI0IiArICI2c2VjIi5zdWJzdHIoMCwxKSArICAnJyArIAonWnBFMScuc3Vic3RyKDMsIDEpICsiMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzgpICsgIiIgKyI1c3VjdXIiLmNoYXJBdCgwKSsiZnN1Ii5zbGljZSgwLDEpICsgJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjJy5jaGFyQXQoMCkrICd1JysnJysnYycuY2hhckF0KDApKyd1c3VjdXInLmNoYXJBdCgwKSsgJ3JzdWMnLmNoYXJBdCgwKSsgJ3N1Y3VyaScuY2hhckF0KDUpICsgJ19zdScuY2hhckF0KDApICsnY3N1Y3VyJy5jaGFyQXQoMCkrICdsJysnbycrJ3UnLmNoYXJBdCgwKSsnZCcrJ3AnKycnKydyc3VjdScuY2hhckF0KDApICArJ3NvJy5jaGFyQXQoMSkrJ3gnKyd5JysnX3N1Y3VyaScuY2hhckF0KDApICsgJ3UnKyd1JysnaXN1Y3VyaScuY2hhckF0KDApICsgJ3N1Y3VkJy5jaGFyQXQoNCkrICdzXycuY2hhckF0KDEpKycxJysnOCcrJzEnKydzdWN1cmQnLmNoYXJBdCg1KSArICdlJy5jaGFyQXQoMCkrJzEnKydzdWN1cjEnLmNoYXJBdCg1KSArICcxc3VjdXJpJy5jaGFyQXQoMCkgKyAnMicrIj0iICsgcyArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>

如何使用 scrapy 绕过 sucuri?

最佳答案

网站使用基于 cookie 和用户代理的保护。你可以这样检查。在 Chrome 中打开 DevTools。导航到目标页面 http://www.dwarozh.net/sport/ ,然后在“网络”选项卡中右键单击对页面的请求,然后“复制为 CURL” 打开控制台并运行 cURL:

$ curl 'http://www.dwarozh.net/sport/all-hawal.aspx?cor=3&Nawnishan=%D9%88%DB%95%D8%B1%D8%B2%D8%B4%DB%95%DA%A9%D8%A7%D9%86%DB%8C%20%D8%AF%DB%8C%DA%A9%DB%95' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2' -H 'Upgrade-Insecure-Requests: 1' -H 'X-Compress: null' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://www.dwarozh.net/sport/details.aspx?jimare=10505' -H 'Cookie: __cfduid=dc9867; sucuri_cloudproxy_uuid_ce28bca9c=d36ad9; ASP.NET_SessionId=wqdo0v; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c=6ab0; _gat=1; __asc=7c0b5; __auc=35; _ga=GA1.2.19688' -H 'Connection: keep-alive' --compressed

您将看到正常的 html 代码。如果您从请求中删除 User-Agent 的 cookie,您将获得上限页面。

让我们在 scrapy 中检查一下:

$ scrapy shell
>>> from scrapy import Request
>>> cookie_str = '''here; your; cookies; from; browser; go;'''
>>> cookies = dict(pair.split('=') for pair in cookie_str.split('; '))
>>> cookies  # check them
{'__auc': '999', '__cfduid': '796', '_gat': '1', '__atuvc': '1%7C49', 'sucuri_cloudproxy_uuid_0d5c97a96': '6ab007eb1
9', 'ASP.NET_SessionId': 'u9', '_ga': 'GA1.2.1968.148', '__asc': 'sfsdf', 'sucuri_cloudproxy_uuid_ce2
sfsdfs': 'sdfsdf'}
>>> r = Request(url='http://www.dwarozh.net/sport/', cookies=cookies, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/56 (KHTML, like Gecko) Chrome
/54. Safari/5'})
>>> fetch(r)
>>> response.xpath('//div[@class="news-more-img"]/ul/li')
[<Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10507">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="de
tails.aspx?jimare=10505">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10504">'>, <Selector xpath='//div[@class="news-more-img"]/
ul/li' data='<li><a href="details.aspx?jimare=10503">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10323">'>]

太棒了!让我们制作一只蜘蛛:

我已经修改了你的,因为我没有一些组件的源代码。

from scrapy import Spider, Request
from scrapy.selector import Selector
import scrapy

#from Stack.items import StackItem
#from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser


class StackSpider(Spider):
        name = "dwarozh"
        start_urls = [
            "http://www.dwarozh.net/sport/",
        ]
        _cookie_str = '''__cfduid=dc986; sucuri_cloudproxy_uuid_ce=d36a; ASP.NET_SessionId=wq; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c97a96=6a; _gat=1; __asc=7c0b; __auc=3; _ga=GA1.2.196.14'''
        _user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/5 (KHTML, like Gecko) Chrome/54 Safari/5'

        def start_requests(self):
            cookies = dict(pair.split('=') for pair in self._cookie_str.split('; '))
            return [Request(url=url, cookies=cookies, headers={'User-Agent': self._user_agent})
                    for url in self.start_urls]

        def parse(self, response):
            mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
            for mItem in mItems:
                item = {} # StackItem()
                item['title'] = mItem.xpath('a/h2/text()').extract_first()
                item['url'] = mItem.xpath('viewa/@href').extract_first()
                yield {'url': item['url'], 'title': item['title']}

让我们运行它:

$ scrapy crawl dwarozh -o - -t csv --loglevel=DEBUG
/Users/el/Projects/scrap_woman/.env/lib/python3.4/importlib/_bootstrap.py:321: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
  return f(*args, **kwds)
2016-12-10 00:18:55 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrap1)
2016-12-10 00:18:55 [scrapy] INFO: Overridden settings: {'SPIDER_MODULES': ['scrap1.spiders'], 'FEED_FORMAT': 'csv', 'BOT_NAME': 'scrap1', 'FEED_URI': 'stdout:', 'NEWSPIDER_MODULE': 'scrap1.spiders', 'ROBOTSTXT_OBEY': True}
2016-12-10 00:18:55 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-10 00:18:55 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-10 00:18:55 [scrapy] INFO: Spider opened
2016-12-10 00:18:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-10 00:18:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-12-10 00:18:55 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/robots.txt> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/sport/> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nلیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nهەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nگرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nبەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nكچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە'}
2016-12-10 00:18:56 [scrapy] INFO: Closing spider (finished)
2016-12-10 00:18:56 [scrapy] INFO: Stored csv feed (5 items) in: stdout:
2016-12-10 00:18:56 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 950,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 15121,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 9, 21, 18, 56, 271371),
 'item_scraped_count': 5,
 'log_count/DEBUG': 8,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 12, 9, 21, 18, 55, 869851)}
2016-12-10 00:18:56 [scrapy] INFO: Spider closed (finished)
url,title
,"
لیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە"
,"
هەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید"
,"
گرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا"
,"
بەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە"
,"
كچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە"

您可能需要不时更新 cookie。你可以为此使用 PhantomJS。

更新:

如何使用 PhantomJS 获取 cookie。

  1. 安装 PhantomJS .

  2. 编写这样的脚本dwarosh.js:

    var page = require('webpage').create();
    page.settings.userAgent = 'SpecialAgent';
    page.open('http://www.dwarozh.net/sport/', function(status) {
      console.log("Status: " + status);
      if(status === "success") {
        page.render('example.png');
        page.evaluate(function() {
        return document.title;
      });
      }
      for (var i=0; i<page.cookies.length; i++) {
        var c = page.cookies[i];
        console.log(c.name, c.value);
      };
      phantom.exit();
    });
    
  3. 运行脚本:

      $ phantomjs --cookies-file=cookie.txt dwarosh.js
      TypeError: undefined is not an object (evaluating  'activeElement.position().left')
    
      http://www.dwarozh.net/sport/js/script.js:5
      https://code.jquery.com/jquery-1.10.2.min.js:4 in c
      https://code.jquery.com/jquery-1.10.2.min.js:4 in fireWith
      https://code.jquery.com/jquery-1.10.2.min.js:4 in ready
      https://code.jquery.com/jquery-1.10.2.min.js:4 in q
    Status: success
    __auc 250ab0a9158ee9e73eeeac78bba
    __asc 250ab0a9158ee9e73eeeac78bba
    _gat 1
    _ga GA1.2.260482211.1481472111
    ASP.NET_SessionId vs1utb1nyblqkxprxgazh0g2
    sucuri_cloudproxy_uuid_3e07984e4 26e4ab3...
    __cfduid d9059962a4c12e0f....1
    
  4. 获取 cookie sucuri_cloudproxy_uuid_3e07984e4 并尝试使用 curl 和相同的 User-Agent 获取页面。

    $ curl -v http://www.dwarozh.net/sport/ -b sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 -A SpecialAgent
    *   Trying 104.25.209.23...
    * Connected to www.dwarozh.net (104.25.209.23) port 80 (#0)
    > GET /sport/ HTTP/1.1
    > Host: www.dwarozh.net
    > User-Agent: SpecialAgent
    > Accept: */*
    > Cookie:     sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465
    >
    < HTTP/1.1 200 OK
    < Date: Sun, 11 Dec 2016 16:17:04 GMT
    < Content-Type: text/html; charset=utf-8
    < Transfer-Encoding: chunked
    < Connection: keep-alive
    < Set-Cookie: __cfduid=d1646515f5ba28212d4e4ca562e2966311481473024; expires=Mon, 11-Dec-17 16:17:04 GMT; path=/; domain=.dwarozh.net; HttpOnly
    < Cache-Control: private
    < Vary: Accept-Encoding
    < Set-Cookie: ASP.NET_SessionId=srxyurlfpzxaxn1ufr0dvxc2; path=/; HttpOnly
    < X-AspNet-Version: 4.0.30319
    < X-XSS-Protection: 1; mode=block
    < X-Frame-Options: SAMEORIGIN
    < X-Content-Type-Options: nosniff
    < X-Sucuri-ID: 15008
    < Server: cloudflare-nginx
    < CF-RAY: 30fa3ea1335237b0-ARN
    <
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>
    Dwarozh : Sport
    </title><meta content="دواڕۆژ سپۆرت هەواڵی ناوخۆ،هەواڵی جیهانی، وەرزشەکانی دیکە" name="description"/><meta property="fb:app_id" content="1713056075578566"/><meta content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="wene/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="wene/style.css" rel="stylesheet" type="text/css"/>
    <script src="js/jquery-2.1.1.js" type="text/javascript"></script>
    <script src="https://code.jquery.com/jquery-1.10.2.min.js" type="text/javascript"></script>
    <script src="js/script.js" type="text/javascript"></script>
    <link href="css/styles.css" rel="stylesheet"/>
    <script src="js/classie.js" type="text/javascript"></script>
    <script type="text/javascript">
    

关于python - 如何使用 sucuri 保护来抓取网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41005076/

相关文章:

python - 以编程方式中断 raw_input

web-scraping - 如何根据亚马逊的位置抓取数据?

caching - 如何禁用scrapy中的缓存?

docker - 带有 cloudflare SSL 的 jwilder/nginx-proxy 没有

ssl - 通过 https 和 http 加载的不同页面

.htaccess - Cloudflare DNS - 如何将所有子域重定向到根域?

python - 我可以在 scikit-learn(和 R)中为 DecisionTreeClassifier 绘制部分图吗

python - 使用 Bokeh 在 map 上绘制

python - Python 命令提示符中的制表符/缩进错误

python - 如何为不同的蜘蛛设置相同的缓存文件夹,现在scrapy在缓存目录中为每个蜘蛛创建子文件夹